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ABSTRACT 



A storage control enables data on various storage media to 
be shared among host computers having various different 
host computer input/output interfaces. A control processor 
checks a host computer interface management table when 
write is requested by a host computer (HCP). The control 
processor writes write data in a cache slot of the cache 
memory without converting the format if the data format of 
the HCP is in an FBA format and it converts the format into 
the FBA format and writes it when the data format of HCP 
is in a CKD format. The processor checks a table when read 
is requested by HCP. If the data format of HCP is FBA, the 
processor transfers the read data read from the cache slot 
without converting it and if it is CKD, the processor converts 
the format into the FBA format and transfers the format 
converted data. The control processors, retrieve write data in 
the cache memory and writes it in a drive. 

19 Claims, 9 Drawing Sheets 
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STORAGE CONTROL AND COMPUTER 
SYSTEM USING THE SAME 

BACKGROUND OF THE INVENTION 

The present invention relates to a storage control and in 5 
particular to a storage control system which enables data on 
various storage media for storing input/output data to/from 
host computers to be shared among said host computers 
having various host computer input/output interfaces; and a 
computer system using the same. 10 

Recently cases have been increased in which main frames 
are linked with a division system of an open system bases, 
such as downsizing of part of operations (transactions, jobs) 
which used to be processed by a main frame to a division 
server (for example, UNIX server, etc.) or incorporation of 15 
an information system into a division. 

In these cases, due to the fact that the data format (CKD 
format) of the main frame is different from the data format 
(FBA format) of the host computer input/output interface of 
the UNIX server, development of programs for data con- 20 
version and data conversion between host computers is 
required or a storage control devoted to each host computer 
input/output interface is necessary. This makes it difficult to 
build a wide range of computer system configurations. 

As one of methods which have been devised in order to 25 
overcome the above-mentioned problems, an integrated 
computer system which enables various programs to be 
executed without limiting the CPU (central processing unit) 
architecture by adopting, for example, a hardware configu- 3Q 
ration including a plurality of computers in one system is 
disclosed in JP-A-60-254270. 

A magnetic disk device including an interface which is 
compatible to a plurality of different interface standards and 
an interface conversion control circuit which enables files on 35 
a magnetic disk to be shared is disclosed in JP-A- 1-309 117. 

SUMMARY OF THE INVENTION 

In the prior art technology which is disclosed in the 
above-mentioned JP-A-60-254270, CPUs having different 4Q 
architectures have a master-slave relationship with each 
other. The CPU on the other slave side having different 
architecture is exclusively prevented to simultaneously use 
the storage medium in order that the CPU on the slave side 
is selected by a hardware switch to occupy a system bus for 45 
executing an input/output operation to/from the storage 
medium. Accordingly when the selected CPU is used for an 
extended period of time, a disadvantage occurs that the 
storage medium and/or the system bus which are resources 
common to the system would be occupied for an extended 5Q 
period of time. 

Since files in the magnetic disk device are divided and 
slured for each of CPU sharing different architectures, it is 
impossible to share the same file in the magnetic disk device 
among CPUs having different architectures. 55 

Although it is possible for host computers having different 
architectures to share a storage medium in the prior art as 
mentioned above, the storage medium is exclusively used 
among host computers having different architectures. This 
may partly cause the utilization efficiency of the file sub- 60 
system to be remarkably lowered. The disadvantage that 
data sharing among host computers having different host 
computer input/output interfaces is not overcome. 

Although file sharing among host devices having different 
interfaces is possible in the above-mentioned JP-A-1- 65 
309117, data sharing among different interfaces is not men- 
tioned. 



It is an object of the present invention to provide a storage 
control in which a request of data access to a storage 
medium from host computers having different host computer 
input/output interfaces is made possible by conducting data 
conversion if it is necessary for such a request and in which 
sharing of data on the storage medium among host comput- 
ers having different host computer input/output interfaces is 
made possible whereby extensibility of file subsystem and 
responsiveness of data is enhanced to enable a wide rage of 
computer system configurations to be built. 

It is another object of the present invention to provide a 
storage control which enables various host computer input/ 
output interfaces and/or various storage medium input/ 
output interfaces to be added to or removed from the storage 
control. 

It is a further object of the present invention to provide a 
computer system including the above-mentioned storage 
control. 

In order to accomplish the above-mentioned object, in an 
aspect of the present invention there is provided a storage 
control in a computer system including a plurality of host 
computers having various different host computer input/ 
output interfaces, the storage control for controlling input/ 
output to/from the host computers, and various storage 
media for storing input/output data of the host computers, 
wherein the storage control is made up of: control 
processors, each being connected to one of host com- 
puters; device input/output interfaces, each for one of 
storage media, for connecting the control processors 
with the various storage media to input/output data in 
a predetermined format to/from each storage medium; 
and a host interface management table for managing 
the data format of the host computer interface of each 
host computer; 
each of the control processors includes a data format 
converting unit which compares the data format of 
corresponding host computer in the host computer 
interface managing table with the predetermined data 
formal when a write data request is issued from the 
corresponding host computer, convert.es the data format 
of the write data into the predetermined data format, 
writes the converted write data in the storage medium 
when the compared formats match with each other and 
writes the write data without converting the data format 
of the write data when they do not match with each 
other, and which also compares the data format of the 
corresponding host computer in the host computer 
interface managing table with the predetermined data 
format when a read data request is issued from the 
corresponding host computer, converts the data format 
of the read data read from the storage medium into the 
predetermined data format to transfer the converted 
read data to the corresponding host computer when the 
compared formats do not match with each other and 
transfers the read data to the corresponding host com- 
puter without converting the data format of the read 
data when they match with each other. This configu- 
ration enables data on various storage media to be 
shared among host computers having different host 
computer input/output interfaces. 
In a computer system having the storage control of the 
above-mentioned configuration, addition or removal of one 
or more host computers having desired kinds of host com- 
puter input/output interfaces and one or more control pro- 
cessors compatible to the host computers is made possible 
by updating the host computer interface managing table. 
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In accordance with one feature of the present invention, medium 203 are connected to the storage control 210, which 

drive control blocks (DCB), each for one of storage devices includes a data access unit 202. An access request 204 

(drives) are provided for controlling the access to each (write), 205 (read), 206 (write) and 207 (read) which are 

storage device. This enables various storage devices having generated by the host computers A200 and B201 to the 

a given device input/output interface to be added or 5 storage media 203 are issued to the storage control 210. 

removed These access requests are executed by the data access unit 

202. 

BRIEF DESCRIPTION OF THE DRAWINGS If the write access request 204 from the host computer 

Ejn , i,i i J- u- uj- is issued, the data access unit 202 performs data 

FIG. 1 is a schematic block diagram showing one embodi- conversion when the data conversioD of th / write data ^ 

ment of a computer system of the present invention; ™ necessary and writes converted data on the storage medium 

FIG. 2 is a schematic block diagram which is useful for 203 and write unconverted write data on the storage medium 

explaining the summary of the present invention; 203 when the data conversion is not necessary. 

FIG. 3 is a schematic block diagram showing the con- If the write access data request 204 is issued from the host 

figuration of another embodiment of a disk subsystem of the J5 computer A200, the data access unit 202 performs data 

present invention; conversion of the write data when it is necessary and writes 

FIG. 4 is a diagram showing the configuration of a drive the converted data on storage medium 203 and writes 

control block; unconverted write data on the storage medium 203 when the 

FIG. 5 is a diagram showing one example of host com- data aversion is not necessary (208). 

puter interface management information; 20 If *c read access request 205 from the host computer 



FIG. 6 is a flow chart showing a data access operation; 



DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 



A200 is issued, the data access unit 202 reads data from the 
™« - . „ , .. _ _ . storage medium and performs data conversion when it is 

FIG. 7 is a flow chart showing a DCB reserving operation; necessary and transfers the evened data to the host 
FIG. 8 is a flow chart showing a DCB releasing operation; corap uter A and transfers the unconverted read data to the 
FIG. 9 is a chart useful for illustrating the conversion 25 host computer A when the data conversion is not necessary, 
between CKD and BAD formats; and Processing for the access request from the host computer 

FIG, 10 is a schematic block diagram showing the con- B201 is conducted similarly to the processing for the access 
figuration of a further embodiment of a disk subsystem of request from the host computer A200. 
the present invention. The above-mentioned operation enables the host comput- 

ers having various different host computer input/output 
interfaces to share the data on the storage medium. 

It is to be noted that the number of the host computers 
Now, the embodiments of the present invention will be having different host computer input/output interfaces is not 
described in detail with reference to drawings. limited to 2, the three or more host computers may be 

FIG. 1 is a block diagram which is useful for explaining 35 connected, 
the principle of the operation of the present invention and FIG. 3 is a block diagram showing the configuration of a 
showing the configuration of an embodiment of a computer disk subsystem having a cache memory in another embodi- 
system of the present invention which comprises a plurality ment of the present invention. 

of host computers having various different input/outputs i n FIG. 3, a disk control 302 is connected to a host 
therefor, a storage control for controlling the input/output 40 computer 300 via a channel control 301 on the host side and 
to/from the host computers and various storage media for is also connected to a host computer 303 via a small 
storing therein input/output data to/from the host computers. computer system interface (abbreviated as SCSI). 

In FIG. 1, the plurality of host computers A100, B101 and in the present embodiment, the host computer 300 is a 
C102 having various different host computer input/output main frame computer (CKD data format) and the host 
interfaces are connected to a magnetic disc device 111, 45 computer 303 is an UNIX computer (FBA data format), 
magnetic tape device 112 and floppy disk device 113 via a The disk control 302 is connected to drives 315 and 316 

storage control 103. which are magnetic storage medium on the lower side. 

Control processors 104, 105, 106, 108, 109 and 110 which The disk control 302 performs read/write of data on the 
are incorporated in the storage control 103 are adapted to drives 315 and 316 in response to the requests of the host 
control the transfer of data among the host computers A100, computers 300 and 303. 

B101 and C102 and the magnetic disk device 111, magnetic Data ^nskr among the host computers 300, 303 and the 
tape device 112 and the floppy disk device 113. drives 315> 316 ^ controlled by the control processors 305, 

In other words, the control processors 104, 105 and 106 306, 310 and 311 which are incorporated in the disk control 
execute an input/output data transfer request from the host 5S 302. 

computers A100, B101 and C102 and the control processors The control processors 305 and 306 are connected to the 
108, 109 and 110 execute a request of input/output data host computers 300 and 303 through the channel control 301 
transfer to the magnetic disk device 111, magnetic tape an d the SCSI bus control 304, respectively and the control 
device 112 and the floppy disk device 113. processors 310 and 311 are connected to the drives 315 and 

All the control processors 104, 105, 106, 108, 109 and 110 60 316 through the drive interfaces 313 and 314, respectively, 
receive and transmit data and control signal with each other The control processors 305 and 306 mainly perform data 
via a control line 107. transfer between the host computers 300 and 303 and the 

FIG, 2 is a block diagram for explaining the outline of the cache memory 309. The control processors 310 and 311 
present invention. Now, the outline of the present invention mainly perform data transfer between the cache memory 309 
will be described with reference to FIG. 2, 55 and the drives 315 and 316. 

In FIG. 2, the host computers A200, B201 having different A common control memory 307 is a common memory 
host computer input/output interfaces and the storage which is accessible from all control processors 305, 306, 310 
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and 311 and stores therein common control information 318 Now, operation of the control processors 305, 306, 310 

used for allowing the disk control 302 to manage the drives and 311 in the disk control 302 in accordance with the 

315 and 316. The common control information 318 will be present invention will be described, 

described hereafter in detail. FIG. 6 is a flow chart showing the main flow of operation 

The cache memory 309 is accessible from all control 5 which is executed by a data access processing unit (600) 

processors 305, 306, 310 and 311 and is used for temporarily including a data format conversion unit, 

storing data which is read from the drives 315 and 316. A Wnen a CO[ltrol proce ssor 305 receives a data access 

cache slot 312 is the data management unit quantity in the command from the host computer 300, DCB processing for 

cache memory 309. reserving the access right for the DCB of the specified drive 

The control processors 305, 306, 310 and 311 receive/ 10 number is executed (step 601) 

transmit data and control signals from/to the cache memory A delermmation as t0 whelher reser vation of DCB is 

309 and the common control memory 307 via a signal line succee ded or not is made (step 602). If failed, the data access 

processing is ended (step 616). If it is successful, following 

The control processors 305 and 306 are connected to a operation will be conducted, 

service processor 317. reservation of the cache s[ot fe conducted (step 

When update of the common control information 318 in m y Subsequently, a determination is made as to whether or 

the common control memory 307 is instructed from the not the data access command is a write command or a read 

service processor 317, the service processor 317 selects any command (step 604). 

one of the control processors 305 and 306 to send an update „ rt 1P t • , M1 , , 

♦l . *u i . j .i mi j . 20 It it is a write command, operation will be conducted as 

request, so that the selected control processor will update the f 0 u ows - 

common control information 318 in the common control 

memory 307 With re f erence t0 the nos t computer interface manage- 

Now, the common control information will be described. ™ nt iD . £ °™ a ' ion <f le 500 (FIG. 5) (step 610), a determi- 

. , . . „. nation is made as to whether the host computer interface 

The common control information 318 includes a drive 2S which is connected to the control processor is in the CKD or 

control block 400 and host computer interface management FBA format (step 611) 

information table 500, which will be described in order. . , , ' 

A , , . * 1 1_, t / LL • , ln the present embodiment, the control processors 305 

FIG. 4 shows a drive control block (abbreviated as DCB) oriH w ' • t . ^™ , CD A fn , - n _ 

Ana rw r»/-T3 At\n • a a c u r*u j • j and 3 " 6 are in lDe CKD and FBA formats 506 and 507, 

400. One DCB 400 is provided for each one of the drives and respectively 

stores therein four data. 30 f . . . ' 

The four data include a drive number 401 for enabling the „ If **, f CK ? f °™ a ^ ^ ™ m ,™I p ™ff 0r 3 ° 5 

disk control 302 to identify each drive, interprocessor exclu- f h , h , , ( T J T* « ^ 

.„ , # An ~ . . , . , . . - .- am jj- writes the converted write data mto the cache slot 312 (step 

sive data 402, mterhost exclusive information 403 and drive Tf - t • ■ tU ™ A , . 4 , v *: 

. c t . AnA 613). It it is in the FBA format, the control processor 305 

vacancy waiting information 404. .; ™ A . . t , , ' ^ 

__ . , . . „ AM . « wntes FBA data uato the cache slot 312 without conducting 

The interprocessor exclusive information 402 is used any conversion. Conversion of FBA data into CKD data will 

when the control processor 305 or 306 exclusively controls 5e described hereafter Thereafter, the cache slot is released 

the DCB access from the other control processor and sets the (step 614) and DCB ^ re i e ased (step 615). 

processor number when the control processor reserves the 1C iL j 

access right for the DCB having the specified drive number . f ^ 6 receiVe , d 15 a read command, operation 

^ r . u u *u * i 40 W1 H be executed as follows: 
and cancels the processor number when the control proces- 

sor releases the access right to the DCB Firstly, a determination is made as to whether a read data 

Hie interhost exclusive information 403 is used when the existS . in the ^che memory (step 617). If the data exists, 

host computer 300 or 303 exclusively controls the drive operation at step 605 and the subsequent steps will be 

access from the other host computer and sets the information conducted - K ™ data exists, operation (1) will be conducted. 

403 "on" when the processor reserves access of the specified 45 In lhe operation (1), the control processors 305 and 306 

drive number and sets it "off" when the host computer instruct the control processors 310, 311 to read data from the 

releases the access right. drives and the control processors 310, 311 read data from the 

The drive vacancy waiting information 404 is used to drives via the drive imerfaces 314 10 write r ead da ^ a ^ 

inform one of the host computers 300 and 304 that a DCB _ n lhe cache slot of the cache me mory. This operation is 

having the drive number specified is vacant upon being omitted m the flow chart of FIG * 6 - 

released from use by the other host computer which has been N° w > operation at step 605 and the subsequent steps will 

using the DCB when the one computer requested to reserve ^ e described. 

the access right for the DCB but was informed of the fact Data on the cache slot 312 is read (step 605). 

that the DCB was being used by the other host computer. 55 With reference to the host computer interface manage- 

FIG. 5 shows a host interface management information ment information table 500 (FIG. 5) (step 606), a determi - 

table 500. nation is made as to whether the host computer interface 

In the table 500, the host interface information 502 is which is connected to the control processor is in the CKD or 

managed for each number 501 of the control processor of a format (step 607). 

host computer connected. It is determined based upon this 60 If it is in the CKD format, the control processor 305 

information as to whether data conversion is to be con- converts the read data from FBA format into CKD format 

ducted. data (step 608) and transfers the converted data to the host 

In the present embodiment, the control processors 305 computer 300 (step 609). Conversion of FBD data into CKD 

(503 in FIG. 5) and 306 (504 in FIG. 5) manage the CKD data will described later. 

and FBA formats 506 and 507, respectively. The present 65 If it is in the FBA format, the data which is read out from 

management information table 500 is set and reset in the cache slot 312 is transferred to the host computer 303 

response to an instruction from the service processor 317. without conducting any conversion. 



) 
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Subsequently, the cache slot 312 is released (step 614) and end of the positioned blocks. The remaining portion of the 

DCB is released (step 615). block is left as it is. In the case of data input/output to/from 

The write data on the cache slot 312 is written into the nost computer having FBA format, the specified LBA 

drives 315, 316 by the control processors 310, 311 asyn- (Logical Block Address) is accessed from the host computer. 

chronously with the operation of the host computers 300, 5 M ^„, ^„ Tr _- f c n A f . a * < c 

mi t *u j ,i . a 4^ ' Now, conversion from FB A format data to CKD format 

303, In other words, the control processors 310, 311 retrieve data wiJ1 be ^^1^ 

the cache memory and when data to be written into the drive 

is found in the cache memory, it writes the data into the In the case of data input/output to/from a host computer 

drive. having CKD format, a block having the C information of 

FIG, 7 is a flow chart showing the flow (700) of operation interest is searched and accessed from the specified C 

in DCB reservation operation (601) in FIG. 6. information. 

Interprocessor exclusive information 402 of the DCB 400 In the case of data input/output to/from a host computer 

corresponding to the specified drive number 401 is preset having FBA format a block of specified LBA is accessed, 

(step 701). 15 

Since the method of conversion is well known, its detailed 

Subsequently, a determination is made as to whether the description will be omitted herein, 
mterhost exclusive information is "on" or not (step 702). 

If it is not "on", the interhost exclusive information 403 ^ thoUgh a magDetic disk ^vicc is used as a storage 
is set "on" (step 703) to set success of reservation in a return m medlum in the above-mentioned embodiments, the above- 
code (step 704). mentioned data access processing can be implemented by 
, . , . . „ . ™ . using a magnetic tape device or floppy disk device in lieu of 
Then, the interprocessor exclusive information 402 is magnetic disk devicCt 
canceled (step 708) to end DCB reservation operation (step 

709). The host computers may be added to or removed from the 

If the interhost exclusive information has already been 25 stora S e contro1 or the computer system for each of control 

"on", the operation will be conducted as follows: processors for processing a request of input/output of the 

t, e wLt , ~ „ ... , . t , host computer and control processors for processing input/ 

The fact that the DCB is being used is reported to the host „ llt ™,. ,„) f (l _ ^ a- v -*u u * K 

computer (step 705). 0UtpU u l to/ j' 0m lhe Sl0rage ™<f™' F L urther ' host com P uters 

v r 7 may be added to or removed from the storage control of a 

Subsequently, the host computer to which the use of DCB 30 drive ( st0 rage) having a device input/output interface. Such 

is reported is recorded in the DCB vacancy waiting infer- an embodiment will be described with reference to FIG. 10. 

mation 404 of DCB 400 (step 706). since components like to those in FIG. 3 are designated by 

Failure of reservation is set in the return code (step 707) like reference numerals, description of them will be omitted, 

and then the interprocessor exclusive information 401 is , * A . c m_ L * , 

canceled (step 708) 35 ' a pair ( ^ of a fiber cnannel control 317 and 

a control processor 318 for controlling the same are con- 

FIG. 8 is a flow chart showing the flow of DCB release nected (added) to or removed (deleted) from a common bus 

processing (615) in FIG. 6. 308 of the disk contro , 302 

Now, the interprocessor exclusive information 402 of c * a j- i 

DCB 400 of the drive number 401 in which release is ™ Fm ^ V > a pur ° f a ^ u disk < FD ) 321 and a 
requested is set (step 801) control processor 320 for controlling the same are connected 

(added) to or removed from the common bus 308 of the disk 

Then, the interhost exclusive information 403 is canceled control 302 
(step 802). 

Then, a determination is made as to whether a host 45 A 7*' 317 UpdateS the ran ?" te °[ ^ 

computer exists, which is registered in the vacancy waiting dnve COnlr ° 1 block * 1S in a comm ™ memory 307 and the 
information of DCB (step 803). host computer interface management table 319 in response 

Tr , , . t0 addition or deletion of the pairs 319 and 322. 

It no host computer exists, the interprocessor exclusive 
information 402 is canceled (step 605) to end (step 806) the 1° accordance with the present invention, data on storage 

drive release operation (step 800). 50 media can be shared by host computers in which storage 

If a host computer exists, vacancy of DCB is reported to controls hav e various different host computer input/output 
the registered host computer (step 804). interfaces as mentioned in the foregoing embodiments. 

^ . , „™ r . . , tixensionabiliiy of file subsystem and the responsiveness of 

This prevents the DCB from being predominantly used by data ^ enhanced, 
either one of the host computers. Then, the interprocessor 55 

exclusive information is canceled (step 805). Smc e host computers having various different host com- 

FIG. 9 is a chart for explaining the conversion between ^ in PU*^ mterfaces or various storage media for 

CKD and FBA format data. ' Stonng ^put/output data of the host computers can be 

connected by a single storage control, a wide range of 
Referring now to FIG. 9, conversion from CKD format 6Q configurations of computer system are possible. 

data to FBA format data will be briefly described. What ^ claimed is . 

Information on which block CKD of (1) and (2) are 1. A storage control for use in a computer system includ- 
positioned is available. Information on data length of CKD ing a plurality of host computers having different kinds of 
is stored in a C (count) area. In the case of data input/output computer input/output interfaces; a storage control for con- 
to/from a host having CKD format, the data length infor- 65 trolling data input/output to/from said host computers; and at 
mation is divided by 512 bytes to determine the number of least one storage medium having a device input/output 
necessary blocks. The data is front-packed from the front interface for conducting data input/output in a given data 
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format for storing input/output data to/from said host 9. A computer system as defined in claim 8 in which the 

computers, number of said storage media is arbitrary changed to allow 

said storage control comprising: a desired number of storage media to be added or removed, 

a plurality of control processors for controlling the 10, A computer system as defined in claim 6 and further 

transfer of input/output data to/from the host 5 including a cache memory which is connected to said 

computers, each of said processors being connected storage media so that mput /output of data is conducted by 

to one of said plurality of host computers; and said host computers via the cache m 

a management table for managing the data format of a , r 

the input/output interface of each of said host com- 11 A Stora § e contro1 a PP aratus for controlling transfer of 

puter; io data between a plurality of host computers and at least one 

each of said control processors including a data format storage medium which stores data of a particular data format 

converting unit which, in response to a request of said particular data format being the same as a data format 

read or write of associated host computer, converts used by at least one of said host computers, said storage 

the format of the read or write data into said given control apparatus comprising: 

data format when the data format from the associated 35 i r* * * i i_ 

host computer dose not match said given data format a ^ ty ° f CODtro1 P rocessors each controlling transfer 

and does not convert the format of the read or write of data between one of the host computers and the 

data when they match. stora § e medium, 

2. A storage control as defined in claim 1 in which the wherein each control processor includes a data format 

content in said management table is set/canceled in response 20 converting unit which, in response to a request to write 

to another processor to allow a desired number of host data in the storage medium from a corresponding host 

™m!vT aSSOdated C ° mr01 P rocessors t0 be added or computer, converts the data format of the write data 

rC 3°A storage control as defined in claim 2 and further f° the data format u wheD thedata fo ™* °f 

including a drive control block for managing the state of 25 lhe wnte data does not malch the P articular data fo ™at 

access to said storage media by said host computers. and does not convert the data format of the write data 

4. A storage control as defined in claim 3 in which the mt0 the particular data format when the data format of 
number of said storage media is arbitrary changed to allow the write data matches the particular data format, 

a desired number of storage media is be added or removed. 12. A storage control apparatus according to claim 11 

5. A storage control as defined in claim 1 and further 30 wherein each data format converting unit of each control 
including a cache memory which is connected to said processor, in response to a request to read data in the storage 
storage media so that input/output of data is conducted by medium from a host t converts the 

Sai , d h A °l^^^ data format of the read data into a data format other than the 

35 particular data format used by the corresponding host com- 



6. A computer system comprising; 



a plurality of host computers having different kinds of , u tU , . * 4 , , 

. ^c puter when the data format used by the correspondine host 

computer mput/output interfaces; a storage control for , . . . »v^uuiu 6 

controOing data input/output to/from said host comput- computer does not match the particular data format and does 

ers- and not convert the data format of the read data into the 

at least one storage medium having a device input/output 40 P articular data formal when the data fonnat used by the 

interface for conducting data input/output in a given corresponding host computer matches the particular data 

data format for storing input/output data to/from said format. 

host computers, 13, A storage control apparatus according to claim 11 

said storage control including: further comprising: 

a plurality of control processors for controlling the 45 a management table for managing information indicating 

transfer of input/output data to/from the host the dala format used b each of the host ut 

computers, each or said processors being connected u • j * r *rt_ . , . , 

to one of said plurality of host computers; and Z T * , f * C ° mpUter 15 determined 

a management table for managing the data format of b * refernn S t0 said 1DforajatlOD - 

the input/output interface of each of said host com- 50 14 A slora g e control apparatus according to claim 13, 

puter; said information stored in said management table is set/ 

each of said control processors including a data format cancelled in response to addition or removal of a host 

converting unit which, in response to a request of computer and a corresponding control processor. 

' read or write from associated host computer, con- 15 A storage apparatus acc0 rding to claim 14, 

verts the format of the read or write data into said 55 fu rtner comprising- 
given data format when the data format of the 

associated host computer does not match said given a drive control block for managing the state of access to 

data format and does not to convert the format of the sa * d storage media by said host computers. 

read or write data when they match. 16. A storage control apparatus according to claim 15, 

7. A computer system as defined in claim 6 in which the 60 wherein the number of said storage media is arbitrarily 
content in said management table is set/canceled in response changed t0 add or remove st media ^ desired 

to another processor to allow a desired number of host i-j A * ,i j- ■ ■ « * , 

computers and associated control processors to be added or 11 ' . A St ° rage COntro1 accordin S t0 claim U - further 

removed. comprising: 

8. A computer system as defined in claim 7 and further 65 a cache memory which is connected to said storage media 
including a drive control block for managing the state of so that input/output of data is conducted by said host 
access to said storage media by said host computers. computers via the cache memory. 
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18. A method of controlling transfer of data between a 
plurality of host computers and at least one storage medium 
which stores data of a particular data format, said particular 
data format being the same as a data format used by at least 
one of said host computers, said method comprising the 
steps of: 

controlling transfer of data between one of the host 
computers and the storage medium; and 

in response to a request to write data in the storage 
medium from a corresponding host computer, convert- 
ing the data format of the write data into the particular 
data format when the data format of the write data does 
not match the particular data format and not converting 
the data format of the write data into the particular data 



10 



format when the data format of the write data matches 
the particular data format. 
19. A method according to claim 18 further comprising 
the step of: 

in response to a request to read data in the storage medium 
from a corresponding host computer, converting the 
data format of the read data into a data format other 
than the particular data format used by the correspond- 
ing host computer when the data format used by the 
host computer does not match the particular data format 
and not converting the data format of the read data into 
the particular data format when the data format used by 
the host computer matches the particular data format. 
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nisir add av cvcrrM is divided into a plurality of sectors, each sector having 

man akkay j>yj»i tM the same, predetermined size. Each sector has a particu- 

CROSS REFERENCE TO RELATED dress, a header field code which allows for the detection 

ArrLiCAHUNb 5 of errors in the header field, a data field of variable 

This is a continuation of application Ser. No. length and ECC ("Error Correction Code") codes, 

07/601.482 filed Oct. 22. 1990 which is a continuation- which allow for the detection and correction of errors 

in-part of Ser. Nos. 07/505,622, 07/506,703, and in the data. 

07/488,749, filed Apr. 6, 1990, Apr. 6, 1990, and Mar. 2, When a disk is written to, the disk controller reads 

1990, respectively. 10 the header field and the header field code. If the sector 

BACKGROUND OF THE INVENTION ^? ired $ T° T - 3nd " 0 header field error is de ' 

tected, the new data is written into the data field and the 

The present invention relates generally to memory new data ECC is written into the ECC field 

storage devices. More particularly, the invention is a Read operations are similar in that initially both the 

method and apparatus for interfacing an external com- 15 header field and header field error code are read If no 

puter to a set of storage devices which are typically disk header field errors exist, the data and the. data correc- 

dn i^ eS ' ^ o * • r . x. • , tion codes are read If no error is detected the data is 

Magnetic disk drive memories for use with digital transmitted to the computer. If errors are detected, the 

computer systems are known. Although many types of error correction circuitry located within the disk con- 

W, v bC roller tries to correct the error. If this is possible, the 

described as - using hard disk drives. However, nothing corrected data is transmitted , otherwise, the disk 

herein should be taken to limit the invention to that drivel contmll^r «oni«ic tn th*n«™l... ♦ !!• • 

particular embodiment. rft„t™iu^ E the computer or master disk 

Many computer systems use a plurality of disk drive £2"" that an uncorrect *>le error has been de- 

memories to store data. A common known architecture 25 T ' _ , . t 

for such systems is shown in FIG. 1. Therein, computer J" I L d " VC . SyStCm WhlCh h&S an 

10 is coupled by means of bus 15 to disk array 20. Disk as ^c ated error correction circuit, external to the indi- 

array 20 is comprised of large buffer 22. bus 24, and a l^f/f controlle ?' 15 shown ' This s y stem »»« a 
plurality of disk drives 30. The disk drives 30 can be Reed-Solomon error detection code both to detect and 
operated in various logical configurations. When a 30 correct errors. Reed-Solomon codes are known and the 
group of drives is operated collectively as a logical ^formation required to generate them is described in 
device, data stored during a write operation can be ™ tty refere nces. One such reference is Practical Error 
spread across one or more members of an array. Disk Correction Design for Engineers, published by Data Sys- 
controllers 35 are connected to buffer 22 by bus 24. tems Technology Corp., Broomfield, Colo. For pur- 
Each controller 35 is assigned a particular disk drive 30. 35 P oses of ,nis application, it is necessary to know that the 

Each disk drive within disk drive array 20 is accessed Reed-Solomon code generates redundancy terms, 

and the data thereon retrieved individually. The disk herein called P and Q redundancy terms, which terms 

controller 35 associated with each disk drive 30 controls are used t0 detect and correct data errors, 

the input/output operations for the-particular disk drive * n lne svstem shown in FIG. 2, ECC 42 unit is cou- 

to which it is coupled. Data placed in buffer 22 is avail- 40 P^d to bus 45. The bus is individually coupled to a 

able for transmission to computer 10 over bus 15. When plurality of data disk drives, numbered here 47, 48, and 

the computer transmits data to be written on the disks, 49 » as wel1 as to the P and Q term disk drives, numbered 

controllers 35 receive the data for the individual disk 51 and 53 through Small Computer Standard Interfaces 

drives from bus 24. In this type of system, disk opera- ("SCSIs") 54 through 58. The American National Stan- 

tions are asynchronous in relationship to each other. 45 dard for Information Processing ("ANSI") has promul- 

In the case where one of the controllers experiences a gated a standard for SCSI which is described in ANSI 

failure, the computer must take action to isolate the document number X3. 130-1986. 

failed controller and to switch the memory devices Bus 45 is additionally coupled to large output buffer 

formerly under the failed controller's control to a prop- 22. Buffer 22 is in turn coupled to computer 10. In this 

erly functioning other controller. The switching re- 50 system, as blocks of data are read from the individual 

quires the computer to perform a number of operations. data disk drives, they are individually and sequentially 

First, it must isolate the failed controller. This means placed on the bus and simultaneously transmitted both 

thai all data flow directed to the failed controller must to the largebuffer and the ECC unit. The P and Q terms 

be redirected to a working controller. from disk drives 51 and 53 are transmitted to ECC 42 

In the system described above, it is necessary for the 55 only. The transmission of data and the P and Q terms 

computer to be involved with rerouting data away from over bus 45 occurs sequentially. The exact bus width 

a failed controller. The necessary operations performed can be any arbitrary size with 8-, 16- and 32-bit wide 

by the computer in completing rerouting requires the buses being common. 

computer's attention. This places added functions on After a large block of data is assembled in the buffer, 

the computer which may delay other functions which 60 the calculations necessary to detect and correct data 

the computer is working on. As a result, the entire sys- errors, which use the terms received from the P and Q 

tern is slowed down. disk drives, are performed within the ECC unit 42. If 

Another problem associated with disk operations, in errors are detected, the transfer of data to the computer 

particular writing and reading, is an associated probabil- is interrupted and the incorrect data is corrected, if 

ity of error. Procedures and apparatus have been devel- 65 possible, 

oped which can detect and, in some cases, correct the During write operations, after a block of data is as- 

errors which occur during the reading and writing of sembled in buffer 22, new P and Q terms are generated 

the disks. With relation to a generic disk drive, the disk within ECC unit 42 and written to the P and Q disk 
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dnves at the same time that the data in buffer 22 is and a backup drive can be substituted for a failed disk 

written to the data disk drives. drive. 

Those disk drive systems which utilize known error The present invention provides an arrangement of 
correction techniques have several shortcomings. In the disk drive controllers, data disk drives and error correc- 
systems illustrated in FIGS. 1 and 2, data transmission is 5 tion code disk drives, the drives being each individually 
sequential over a single bus with a relatively slow rate coupled to a small buffer memory and a circuit for error 

of data transfer. Additionally, as th.e error correction detection and correction. A first aspect of the present 

circuitry must wait until a block of data of predefined invention is error detection and correction which oc- 

size is assembled in the buffer before it can detect and curs nearly simultaneously with the transfer of data to 
correct errors therein, there is an unavoidable delay 10 and from the disk drives. The multiple buffer memories 

while such detection and correction takes place. can then be read from or written to in sequence for 

As stated, the most common form of data transmis- transfers on a data bus to the system computer. Addi- 

sion in these systems is serial data transmission. Given tionally, the error correction circuitry can be connected 

that the bus has a fixed width, it takes a fixed and rela- to all of the buffer memory/disk drive data paths 

tively large amount of time to build up data in the buffer 15 through a series of multiplexer circuits called cross-bar 

for transmission either to the disks or computer. If the ("X-bar") switches. These X-bar switches can be used 

large, single buffer fails, all the disk drives coupled to decouple failed buffer memories or disk drives from 

thereto become unusable. Therefore, a system which the system. 

has a plurality of disk drives which can increase the rate A number of disk drives are operatively intercon- 

of data transfer between the computer and the disk 20 nected so as to function at a first logical level as one or 

drives and more effectively match the data transfer rate more logical redundancy groups. A logical redundancy 

to the computer's maximum efficient operating speed is group is a set of disk drives which share redundancy 

desirable. The system should also be able to conduct data. The width, depth and redundancy type (e.g., mir- 

this high rate of data transfer while performing all nec- rored data or check data) of each logical redundancy 

essary error detection and correction functions and at 25 group, and the location of redundant information 

the same time provide an acceptable level of perfor- therein, are independently configurable to meet desired 

mance even when individual disk drives fail. capacity and reliability requirements. At a second logi- 

Another failing of prior art systems is that they do not cai level, blocks of mass storage data are grouped into 

exploit the full range of data organizations that are one or more logical data groups. A logical redundancy 

possible in a system using a group of disk drive arrays. 30 group may be divided into more than one such data 

In other words, a mass storage apparatus made up of a group. The width, depth, addressing sequence and ar- 

plurahty of physical storage devices may be called upon rangement of data blocks in each logical data group are 

to operate as a logical storage device for two concur- independently configurable to divide the mass data 

rently-running applications having different data stor- storage apparatus into multiple logical mass storage 

age needs. For example, one application requiring large 35 areas each having potentially different bandwidth and 

data transfers (i.e., high bandwidth), and-the other re- operation rate characteristics. 

quiring high 'frequency transfers (i.e., high operation A third logical level, for interacting with application 
rate). A third application may call upon the apparatus to software of a host computer operating system, is also 
provide both high bandwidth and high operating rate. provided. The application level superimposes' logical 
Known operating techniques for physical device sets do 40 application units on the data groups to allow data 
not provide the capability of dynamically configuring a groups, alone or in combination from one or more re- 
single set of physical storage devices to provide optimal dundancy groups, to appear to application software as 
service in response to such varied needs. single logical storage units. 

It would therefore be desirable to be able to provide As data is written to the drives, the error correction 

a mass storage apparatus, made up of a plurality of 45 circuit, herein called the Array Correction Circuit 

physical storage devices, which could flexibly provide ("ACC"), calculates P and Q redundancy terms and 

both high bandwidth and high operation rate, as neces- stores them on two designated P and Q disk drives 

sary, along with high reliability. through the X-bar switches. In contrast to the discussed 

SUMMARY OF THE INVENTION prior art * the P resent invention's ACC can detect and 

50 correct errors across an entire set of disk drives simulta- 

Tne present invention provides a set of small, inex- neously, hence the use of the term "Array correction 

pensive disk drives that appears to an external computer Circuit." In the following description, the term ACC 

as one or more logical disk drives. The disk drives are will refer only to the circuit which performs the neces- 

arranged in sets. Data is broken up and written across sary error correction functions. The codes themselves 

the disk drives in a set, with error detection and correc- 55 will be referred to as Error Correction Code or "ECC". 

tion redundancy data being generated in the process On subsequent read operations, the ACC may compare 

and also being written to a redundancy area. Backup the data read with the stored P and Q values to deter- 

disk drives are provided which can be coupled into one mine if the data is error-free. 

or more of the sets. Multiple control systems for the sets The X-bar switches have several internal registers 

are used, with any one set having a primary control 60 As data is transmitted to and from the data disk drives 

system and another control system which acts as its it must go through a X-bar switch. Within the X-bar 

backup, but primarily controls a separate set. The error switch the data can be clocked from one register to the 

correction or redundancy data and the error detection next before going to the buffer or the disk drive. The 

data is generated "on the fly" as data is transferred to time it takes to clock the data through the X-bar internal 

the disk drives. When data is read from the disk drives, 65 registers is sufficient to allow the ACC to calculate and 

the error detection data is verified to confirm the integ- perform its error correction tasks. During a write oper- 

nty of the data. Lost data from a particular disk drive ation, this arrangement allows the P and Q values to be 

can be regenerated with the use of the redundancy data generated and written to their designated disk drives at 
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the same time as the data is written to its disk drives, the controllers and the first level controllers can also com- 

STLT h nng H ? Par f ' f n CffeC ! lhC X " b f n,unica,e between themselves. In a preferred embodi- 

switches establish a data pipeline of several stages, the mem, the system is configured such that the second 

ptoahty of stages effectively providmg a time delay level controllers are grouped in pairs. This cLfigura 

In one preferred embodiment, two ACC units are ' ScS^ 
provided. Both ACCs can be used simultaneously on ~ for J^*™^ ^ ™J«™° n ^ 
two operations that access different disk drives or one ^^^a^^ procedures for the How of 
can be used if the other fails. tllitl^L 1 , ?a ves ' .^or ease of understanding, the 
The X-bar switch arrangement also provides flexibil- 10 SSS Z h ? ? C * y SF l confl S ured ™' h 
ity in the data paths. Undfr control of the system con- S < 2 r ^JJ^ ™ tf ?f"- ? f COurs t ^ 11 sho » ld be 
troller. a malfunctioning disk drive can be decoupled • that ^ c 5> ntroIlers could be 
from the system by reconfiguring the appropriate X-bar co ™f r f <? . ,n * rou P? of three or other groupings, 
switch or switches and the data that was to be stored on rt rl ^ u ™P lra wn««J t0 c °™ect each 
the failed disk can be rerouted to another data disk 15 i £ esecond IeveI controllers to a group of disk drives, 
drive. As the system computer is not involved in the CaSC tha ] 3 Second Ievel controller sh °u1d fail, the 
detection or correction of data errors, or in reconfigur- computer ™e6 not get involved with the rerouting of 
ing the system in the case of failed drives or buffers * ! ? the disk dnves ' ,nstead ' lhe first leveI controllers 
these processes are said to be transparent to the system ? n6 ™ e Properly working second level controller can 
computer. 20 handle the failure without the involvement of the corn- 
In a first embodiment of the present invention, a plu- P U . ter * This allows the ]Q & ca] configuration of the disk 
rality of X-bar switches are coupled to a plurality of es t0 remain constant from the perspective of the 
disk drives and buffers, each X-bar switch having at c °mputer despite a change in the physical configura- 
least one data path to each buffer and each disk drive. In tI0 "* 

operation a failure of any buffer or disk drive may be 25 , Tnere are two lev els of severity of failures which can 

compensated for by rerouting the data flow through a arise in the second level controllers. The first type is a 

X-bar switch to any operational drive or buffer. In this complete failure. In the case of a complete failure, the 

embodiment full performance can be maintained when second level controller stops communicating with the 

disk drives fail. first level controllers and the other second level con- 
In another embodiment of the present invention, two 30 tr P' ler - T° e fi rst Jev el controllers are informed of the 

ACC circuits are provided. In certain operating modes, failure by the properly working second level controller 

such as when all the disk drives are being written to or or mav recognize this failure when trying to route data 

read from simultaneously, the two ACC circuits are t0 tne faiIed second level controller. In either case, the 

redundant, each ACC acting as a back-up unit to the ^ level controller will switch data paths from the 

other. In other modes, such as when data is written to 35 fail€cl second level controller to the properly function- 

an individual disk drive, the two ACCs work in parallel, in S second level controller. Once this rerouted path has 

the first ACC performing a given action for a portion of been established, the properly functioning second level 

the entire set of drives, while the second ACC performs controller issues a command to the malfunctioning sec- 

a given action which is not necessarily the same for a ond Ieve * controller to release control of its disk drives, 
remaining portion of the set. 40 The properly functioning second level controller then 

In yet another embodiment, the ACC performs cer- assumes control of these disk drive sets, 
tain self-monitoring check operations using the P and Q The second type of failure is a controlled failure 
redundancy terms to determine if the ACC itself is wner e the failed controller can continue to communi- 
functioning properly. If these check operations fail, the cate with the rest of the system. The partner second 
ACC will indicate its failure to the control system, and 45 le vel controller is informed of the malfunction. The 
it will not be used in any other operations. properly functioning second level controller then in- 
In still another embodiment, the ACC unit is coupled forms the first level controllers to switch data paths to 
to all the disk drives in the set and data being transmit- the functioning second level controller. Next, the mal- 
ted to or from the disk drives is simultaneously recov- functioning second level controller releases its control 
ered by the ACC. The ACC performs either error de- 50 of the disk drives and the functioning second level con- 
tection or error correction upon the transmitted data in troller assumes control. Finally, the properly function- 
parallel with the data transmitted from the buffers and ing second level controller checks and, if necessary 
the disk drives. corrects data written to the drives by the malfunction- 

The present invention provides a speed advantage ing second level controller, 

over the prior art by maximizing the use of parallel 55 A further aspect of the present invention is a SCSI 
paths to the disk drives. Redundancy and thus fault-tol- ' bus switching function which permits the second level 

erance is also provided by the described arrangement of controllers to release and assume control of the dick 

the X-bar switches and ACC units. drives. 

Another aspect of the present invention is that it For a more complete understanding of the nature and 

switches control of disk drive sets when a particular 60 the advantages of the invention, reference should be 

controller fails. Switching is performed in a manner made to the ensuing detail description taken in con iunc- 

transparent to the computer. tion with the accompanying drawings 

The controllers comprise a plurality of first level 

controllers each connected to the computer. Connected BRIEF DESCRIPTION OF THE DRAWINGS 

to the other side of the first level controllers is a set of 65 FIG. 1 is a block diagram illustrating a prior art disk 

second level controllers. Each first level controller can array system- 

route data to any one of the second level controllers. FIG. 2 is a block diagram illustrating a prior art disk 

Communication buses ue together the second level array system with an error check and corrLL block; , 
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FIG. 3 is a diagram illustrating a preferred embodi- ing. but not limited to, floppy disks, magnetic tape 
ment of the overall system of the present invention; drives, and optical disks. 

FIG. 3.5 is a diagram showing a pair of disk drive sets 
18A and 18B connected to a pair of second level con* Overall System Environment 

tr °i?£ S * 4A a " d 14B - 5 One preferred embodiment of the present invention 

FIG. 4 is a diagram showing a more detailed illustra- operates in the environment shown in FIG. 3. In FIG. 3, 
tion of FIG. 3 including the interconnections of the computer 10 communicates with a group of disk drive 
switches and the disk drives within the disk drive sets; sets 18 through controller 11. In a preferred embodi- 

FIG. 5 is a block diagram of the wiring between the ment, controller 11 includes a number of components 
controllers and the switches; 10 which permit computer 10 to access each of disk drive 

FIG. 6 is a block diagram showing the schematic sets 18 even when there is a failure in one of the compo- 
circuitry of the switching function control circuitry ncnts of controller 11. As shown in FIG. 3, controller 
shown in FIG. 5; 11 includes a pair of two-level devices 13. Within each 

FIG. 7 is a recovery state transition diagram illustrat- of the two-level devices 13 is a first level controller 12 
ing the various possible states of a particular second 15 and a second level controller 14. A switch 16 which 
level controller; comprises a group of switches permits computer 10 to 

FIGS. 8A-8I show the events which take place dur- access disk drive sets 18 through more than one path. In 
ing the transition between each of the states shown in this wav » if either of two-level devices 13 experience a 
FIG. 7; failure in one of their components, the path may be 

FIG. 9 is a block diagram of one preferred embodi- 20 re - roule d without computer 10 being interrupted, 
ment of the X-bar circuitry; 3 *5 is a diagram showing a pair of disk drive sets 

FIG. 10 is a block diagram of a preferred embodiment 18A and I8B connected to a pair of second level con- 
of the error check and correction circuitry; trailers 14A and 14B. Controllers 14A and 14B each 

FIG. 11 is a detailed block diagram of the X-bar include two interface modules 27 for interfacing second 
switches and the ACC shown in FIG. 10; 25 Ievel controllers 14 with a pair of first level controllers 

FIGS. 12a and 12b show the logic operations neces- 12 ( shown in FIG * 3 )- Interface modules 27 are con- 
sary to calculate the P and Q error detection terms- n«*ed to buffers 330-335 which buffer data to be trans- 

FIGS. 13a and 13b show how the Reed-Solomon mitt f d t0 and received from the disk drives. Second 
codeword is formed and stored in one embodiment of , n €Vel controllers 14 are configured to be primarily re- 
the present invention; 30 s P onsible for one group of disk drives and secondarily 

FIGS. 14a and Ub show the parity detector and m P°" s J ble f or 8 ^nd group of disk drives. As shown, 
parity generator circuits in the ACC; • C I^? r fi« r 14A is P™ 3 ™ 1 * responsible for 

FIGS. 15, 16, 17, and 18 show, respectively, the data d ? ves 20A1 : 20D1 and 20E1 «* sec- 

How during a Transaction Mode Normal Read, a Trans- „ ^SSf^ f ° T 6 f- d ™t 20A ?' 20B2 ' 20C2 ' 

action Mode Failed Drive Read, a Transaction Mode 35 J™""' " °^ 2 \ ***** Jive -MX is shared by both 
Read-Modify-Write Read and a Transaction Mode f^" d le ^ 
Read-Modify-Write Write; 3 ™T J \ , led ' 

FIG.Wisaschematicdiagramofasetofdiskdrives t^^^^ST^ ? t M * ond l ™ll° n ' 
in which check data is distributed among drives of the ^ \ n ^l J^ c ^ ! n da « inlerfaces 31 ' These 
set according to a known technique; 40 ™*Z*™* are *! b * «*ntn>!lc» 14 to configure the disk 

FIG. 20 is a schematic diagram of a mass storage J™ ^^S^ 
system suitable for use with the present invention; t^^irS^J^l^ i a T r ^ WB2> 

FIG. 21 is a schematic diagram of the distribution of SS'SiSS 
data on the surface of a magnetic disk; - rrrtr nm „ fi ^ A ir 7 ? ^ . I . 

FIG. 22 is a schematic diagram of the distribution of 45 20X is sX take Si Of* ' %"?Jf- 
data in a f.rst preferred embodiment of a redundancy the^ ^Vr^ °S ^ ° f 

FIG. 23 is a schematic diagram of the distribution of pig a \* mn « A^ a \\^ h;™.™ ^ *i_ • 

tevenSn redU " danCy gf ° Up 8CCOrd,ng ,0 the P resent tern associated with cVmpmer 10 for P 8 cc2g disk 
FIG^isadiagramshowinghowthememoryspace ^ * ^l^o^X d ^ 

of a dev.ce set might be configured » accordance with Second level controllers 14A and M^are shown con! 

the principles of the present invention; and 5 5 nected to first level controllers 12A and 12B Tne lines 

riG. « .s a diagram o, an exemplary embodiment of between second level controllers 14 and first level con! 

teSe?S^ Wte ^ ,te,0,, ™ ,,eWb0, r erS 12 repreSem date buses "aa 
the present invention. fl ows ^ we „ M contro , ^ ^ 

DETAILED DESCRIPTION OF THE ,ine be,wetn second level controller 14A and secQnd 

DRAWINGS 60 level controller 14B represents a communication line 

tv r j u , . through which the second level controllers communi- 

The preferred embodiments of the present invention cate with each other communi 

comprise a system for mass data storage In the pre- Second level controllers 14 are each connected to a 

ferred embodiments described herein the preferred group of disk drive sets 18A-18F through swUches 

devices for storing data are hard disk drives, referenced 6$ 16A-16F. swucnes 

herein as disk drives. Nothing herein should be under- Disk drives 20 are arranged in a manner so that each 

stood to limit this invention to using disk drives only. second level controller 14 is primarily responsible for 

Any other device for storing data may be used, includ- one group of disk drive sets. As shown in FIG 4 sec 
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ond level controller 14A may be primarily responsible sense line 50A is active, the appropriate second level 

for three of the disk drive sets 18A-18F. Similarly, controller will be in control at a particular time, 

second level controller 14B may be primarily responsi- FIG. 7 is a state transition diagram showing the rela- 

ble for the remaining three disk drive sets 18A-18F. tionships between the various states of CRS 22 (FIG. 3) 

Second level controllers 14 are secondarily responsible 5 of a particular second level controller 14. Each second 

for the disk drives primarily controlled by the partner level controller 14 must be in only one state at any 

second level controller. In the particular arrangement particular point in time, Initially, assuming that the 

shown in FIG. 4, second level controller 14A may be system is functioning properly and each second level 

primarily responsible for the left three disk drive sets controller 14 is primarily responsible for half of the disk 

18A, 18B and 18C and secondarily responsible for the 10 drive sets 18 and secondarily responsible for half of the 

right three disk drives sets 18D, 18E and 18F. Second disk drive sets 18, second level controller 14 is in a 

level controller 14B is primarily responsible for the PRIMARY STATE 26. While in PRIMARY STATE 

right three disk drive sets 18D, 18E and 18F and sec- 26, two major events may happen to move a second 

ondarily responsible for the left three disk sets 18A.18B level controller 14 from PRIMARY STATE 26 to 

and 18C 15 another state. The first event, is the failure of the partic- 

Each second level controller 14 contains a second ular second level controller 14. If there is a failure, 

level controller recovery system (CRS) 22. CRS 22 is a second level controller 14 shifts from PRIMARY 

portion of software code which manages the communi- STATE 26 to a NONE STATE 28. In the process of 

cation between second level controllers 14 and first doing so, it will pass through RUN-DOWN-PRIMAR- 

level controllers 12. CRS 22 is typically implemented as 20 IES-TO-NONE STATE 30. 

a state machine which is in the form of microcode or There are two types of failures which are possible in 
sequencing logic for moving second level controller 14 second level controller 14. The first type of failure is a 
from state to state (described below). State changes are controlled failure. Further, there are two types of con- 
triggered as different events occur and messages are trolled failures. 

sent between the various components of the system. 25 The first type of controlled failure is a directed con- 
An ECC block 15 is also included in each second trolled failure. This is not actually a failure but instead 
level controller 14. ECC block 15 contains circuitry* for an instruction input from an outside source instructing a 
checking and correcting errors in data which occur as particular second level controller to shut down. This 
the data is passed between various components of the instruction may be received in second level controller 
system. This circuitry is described in more detail below. 30 14 from one of the following sources; An operator, 
FIG. 5 is a block diagram showing a more detailed through computer 10; a console 19 through a port 24 
illustration of the interconnections between second (e.g. RS-232) on the first level controller; a diagnostic 
level controllers 14A and 14B and the disk drives. For console 21 through a port 23 (e.g. RS-232) on the sec- 
simplicity, only a single disk drive port is shown. More ond level controller; or by software initiated during 
disk drive pons are included in the system as shown in 35 predictive maintenance. Typically, such an instruction 
FIGS. 3 and 4. is issued in the case where diagnostic testing of a second 
Second level controller 14A has a primary control/- level controller is to be conducted. In a directed con- 
sense line 50A for controlling its primary set of disk trolled failure, the second level controller finishes up 
drives. An alternate control/sense line 52A controls an any instructions it is currently involved with and refuses 
alternate set of disk drives. Of course, second level 40 to accept any further instructions. The second level 
controller 14B has a corresponding set of control/sense controller effects a "graceful" shut down by sending 
lines. Data buses 54A (second level controller 14A) and out messages to the partner second level controller that 
54B (second level controller 14B) carry the data to and it will be shutting down. 

from disk drives 20, These data buses are typically in the The second type of controlled failure is referred to as 

form of a SCSI bus. 45 a moderate failure. In this case, the second level con- 

A set of switches 16A-16F are used to grant control troller recognizes that it has a problem and can no 

of the disk drives to a particular second level controller. longer function properly to provide services to the 

For example, in FIG. 4, second level controller 14A has system. For example, the memory or drives associated 

primary responsibility for disk drives 20A-20C and with that second level controller may have malfunc- 

alternate control of disk drives 20D-20F. Second level 50 tioned. Therefore, even if the second level controller is 

controller 14B has primary control of disk drives properly functioning, it cannot adequately provide ser- 

20D-20F and alternate control of disk drives 20A-20C. vices to the system. It aborts any current instructions, 

By changing the signals on control/sense lines 50 and refuses to accept any new instructions and sends a mes- 

52, primary and secondary control can be altered. sage to the partner second level controller that it is 

FIG. 6 is a more detailed illustration of one of the 55 shutting down. In both controlled failures, the malfunc- 

switches 16A-16F. A pair of pulse shapers 60A and 60B tioning second level controller releases the set of disk 

receive the signals from the corresponding control/- drives over which it has control. These drives are then 

sense lines 50A and 52B shown in FIG. 5. Pulse shapers taken over by the partner second level controller. 

60 clean up the signals which may have lost clarity as The second type of failure is a complete failure. In a 

they were transmitted over the lines. Pulse shapers of 60 complete failure, the second level controller becomes 

this type are well% known in the art. The clarified inoperable and cannot send messages or "clean-up" its. 

signals from pulse shapers 60 are then fed to the set and currently pending instructions by aborting them. In 

reset pins of R/S latch 62. The Q and Q outputs of latch other words, the second level controller has lost its 

62 are sent to the enable lines of a pair of driver/receiv- ability to serve the system. It is up to one of the first 

ers 64A and 64B. Driver/receivers 64A and 64B are 65 level controllers or the partner second level controller 

connected between the disk drives and second level to recognize the problem. The partner second level 

controllers 14A and 14B. Depending upon whether controller then takes control of the drives controlled by 

primary control/sense line 52B or alternate control/- the malfunctioning second level controller. The routing 
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through the malfunctioning second level controller is 
switched over to the partner second level controller. 

In all of the above failures, the switching takes place 
without interruption to the operation of the computer. 
Second level controllers 14 and first level controllers 12 
handle the rerouting independently by communicating 
the failure among themselves. 

Assuming there was a failure in second level control- 
ler 14A, second level controller 14A moves from PRI- 
MARY STATE 26 through a transition RUN-DOWN- 
PRIMARIES-TO-NONE STATE 30 to NONE 
STATE 28. At the same time, properly functioning 
second level controller 14B moves from PRIMARY 
STATE 26 to BOTH STATE 32. The basis for the 



12 



computer 10 could access any one of disk drive sets 18. 
As in the previous example, if second level controller 
14A were to fail it moves from SECONDARY STATE 
36 through RUN-DOWN-SECONDARIES-TO- 
5 NONE STATE 40 and into NONE STATE 28. At the 
same time, properly functioning second level controller 
14B moves from SECONDARY STATE 36 along the 
preempt b/p line into BOTH STATE 32. Preempt b/p 
stands for "preempt both/primaries." In other words, 
10 all of the disk drives are preempted by the properly 
functioning second level controller. 

If, for all sets 18, second level controller 14A is in 
NONE STATE 28 and second level controller 14B is in 



BOTH STATE 32, it is possible for second level con- 
change in state of each of second level controllers 14A 15 trailer 14A to take control of all sets 18 of disk drives, 
and 14B is the failure of second level controller 14A. This is desirable if second level controller 14A were 
When a second level controller fails, it is important to repaired and second level controller 14B failed. Second 
switch disk drive control away from the failed second level controller 14A moves from NONE STATE 28 
level controller This, permits computer 10 to continue along the preempt b line to BOTH STATE 32. At the 
to access disk drives which were formerly controlled by 20 same time, second level controller 14B moves from 



a particular second level controller which has failed. In 
the current example (FIG. 4), disk drive sets 18A-18C 
are switched by switching functions 16A-16C so that 
they are controlled by second level controller 14B. 
Therefore, second level controller 14B is in BOTH 
STATE 32 indicating that it has control of the disk 
drive sets 18 for both second level controllers. Second 
level controller 14A now controls none of the disk 
drives and is in NONE STATE 28. The transition state 



BOTH STATE 32 through RUN-DOWN-BOTH-TO- 
NONE STATE 42 and into NONE STATE 28. At this 
point, second level controller 14A controls all disk 
drives while second level controller 14B controls none 
25 of the disk drives. 

Various failures may trigger the movement of second 
level controllers 14 between states. Between states a 
number of events take place. Each of these events is 
described in FIGS. 8A-8I. In FIG. 8A, second level 



30 determines which of several possible transition paths 30 controller 14 is in PRIMARY STATE 26. There are 



is used. 

If second level controller 14A is in NONE STATE 
28 and second level controller 14B is in BOTH STATE 
32 there are a number of options for transferring control 
of disk drive sets 18A-18F once second level controller 
14A has been repaired. First, second level controller 
14A and second level controller 14B could each be 
shifted back to PRIMARY STATE 26. This is accom- 
plished for drive sets 18A-18C by second level control- 



three different events which can take place while sec- 
ond level controller 14 is in PRIMARY STATE 26. 
The first event is for a preempt message 100 to be re- 
ceived from the partner second level controller. At this 
35 point, the second level controller receiving such a mes- 
sage will take the secondary path, represented by block 
102, and end up at BOTH STATE 32. The second path 
which may be taken is triggered by receipt of a message 
104 from CRS 22 of the other second level controller. 



ler 14A moving from NONE STATE 28 directly to 40 This may be some sort of communication which results 



PRIMARY STATE 26 along the preempt p line. Pre- 
empt p simply stands for "preempt primary'* which 
means that second level controller 14A preempts its 
primary drives or takes control of them from second 
level controller 14B. At the same time, second level 45 
controller 14B moves from BOTH STATE 32 through 
a transition RUN-DOWN-SECONDARIES-TO-PRI- 
M ARIES STATE 34 and then to PRIMARY STATE 
26. 

A second alternative is for second level controller 
14A to move from NONE STATE 28 to SECOND- 
ARY STATE 36. Once in SECONDARY STATE 36, 
second level controller 14A is in control of its second- 
ary disk drive sets 18D-18F. Second level controller 
14B concurrently moves from BOTH STATE 32 55 
through RUN-DOWN-PRIMARIES-TO-SECON- 
DARIES STATE 38 and on to SECONDARY 
STATE 36. When both second level controllers are in 



in the second level controller remaining in PRIMARY 
STATE 26. It will report and return messages 106 to 
the other second level controller. The final path which 
may be taken results in second level controller ending 
up in RUN-DOWN-PRIMARIES-TO-NONE 
STATE 30. This path is triggered upon receipt of a 
message 108 to release both sets of drives or the primary 
disk drives. A timer is then set in block 110 and upon 
time out a message 112 is sent to the other second level 
50 controller to take control of the primary set of disk 
drives. Once in RUN-DOWN-PRIMARIES-TO- 
NONE STATE 30, second level controller 14 will 
eventually end up in NONE STATE 28. 

FIG. 8B illustrates various paths from RUN-DOWN- 
PRIMARIES-TO-NONE STATE 30 to NONE 
STATE 28. Three possible events may take place. First, 
a message 114 may be received from another second 
level controller providing communication information. 
In this case, second level controller 14 reports back 



SECONDARY STATE 36, they are in control of their 

secondary disk drive sets. Second level controller 14A 60 messages 116 and remains in RUN-DOWN-PRIMAr" 

controls disk drive sets 18D-18F and second level con- IES-TO-NONE STATE 30. The second event which 

trailer 14B controls disk drive sets 18A-18C. may occur is for the timer, set during transition from 

From SECONDARY STATE 36, a failing second PRIMARY STATE 26 to RUN-DOWN-PRIMAR- 

level controller 14 may move through RUN-DOWN- IES-TO-NONE STATE 30 to time out 118 If this 

SECONDARIES-TO-NONE STATE 40 to NONE 65 happens, second level controller 14 realizes that mes- 

STATE 28. If this occurs, the properly functioning sage 112 (FIG. 8A) didn't get properly sent and that 

partner second level controller 14 moves from SEC- there has been a complete failure. It releases control of 

ONDARY STATE 36 to BOTH STATE 32 so that both its primaries and secondary disk drives 122. It then 
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ends up in NONE STATE 28. The third event which 
may occur while in RUN-DOWN-PRIMARIES-TO- 
NONE STATE 30 is for a response to be received 124 
from message 112 (FIG. 8A) sent out while second level 
controller moved from PRIMARY STATE 26 to 5 
RUN-DOWN-PRIMARIES-TO-NONE STATE 30. 
This response indicates that the message was properly 
received. Second level controller 14 then releases its 
primary drives 126 and ends up in NONE STATE 28. 

FIG. 8C covers the state transition between NONE 10 
STATE 28 and one of either BOTH STATE 32, PRI- 
MARY STATE 26, or SECONDARY STATE 36. 
When in NONE STATE 28, second level controller 14 
can only receive messages. First, it may receive a mes- 
sage 128 instructing it to preempt both its primary and 15 
alternative sets of disk drives. It performs this function 
130 and ends up in BOTH STATE 32. A second possi- 
bility is for it to receive a preempt message 132 instruct- 
ing it to preempt its primary set of drives. It performs 
this instruction and ends up in PRIMARY STATE 26. 20 
A third alternative is the receipt of a preempt message 
136 instructing second level controller 14 to preempt its 
secondary drives. Upon performance of this instruction 
138 it ends up in SECONDARY STATE 36. Finally, 
while in NONE STATE 28 second level controller 14 25 
may receive communication messages 140 from its part- 
ner second level controller. It reports back 142 to the 
other second level controller and remains in NONE 
STATE 28. 

FIG. 8D illustrates the movement of second level 30 
controller 14 from SECONDARY STATE 36 to 
BOTH STATE 32 or RUN-DOWN-SECONDAR- 
IES-TO-NONE STATE 40. While in SECONDARY 
STATE 36, any one of three messages may be received 
by second level controller 14. A first possibility is for a 35 
preempt both or primary message 144 to be received. At 
this point, second level controller 14 takes control of its 
primary drives 146 and ends up in BOTH STATE 32. A 
second possibility is for communication messages 148 to 
be received from the partner controller. This results in 40 
second level controller 14 reporting back 150 and re- 
maining in its present SECONDARY STATE 36. Fi- 
nally, a release both or secondary message 152 may be 
received. Second level controller 14 sets a timer 154 
upon receipt of this message. It then sends out a message 45 
156 indicating it is now in RUN-DOWN-SECON- 
DARIES-TO-NONE STATE 40. 

FIG. 8E shows the transition of second level control- 
ler 14 from RUN-DOWN-SECONDARIES-TO- 
NONE STATE 40 to NONE STATE 28. Three differ- 50 
ent messages may be received during RUN-DOWN- 
SECONDARIES-TO-NONE STATE 40. First, mes- 
sages 158 from the partner second level controller may 
be received. Second level controller 14 then reports 
back (160) to its partner and remains in RUN-DOWN- 55 
SECONDARIES-TO-NONE STATE 40. A second 
possibility is for the timer, set between SECONDARY 
STATE 36 and the present state, to time out (162). This 
indicates that message 156 (FIG. 8D) was not properly 
sent out and received by the partner second level con- 60 
troller and that there has been a complete failure to 
second level controller 14. Second level controller 14 
then reports out (164) that it will release both of its sets 
of disk drives 166. This results in it moving to NONE 
STATE 28. Finally, second level controller 14 may 65 
receive a response 168 to its message 156 (FIG. 8D) sent 
after setting the timer between SECONDARY STATE 
36 and RUN-DOWN-SECONDARIES-TO-NONE 
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STATE 40. Upon receiving this response, it releases its 
secondary drives and ends up in NONE STATE 28. 

FIG. 8F illustrates the various paths from BOTH 
STATE 32 to any one of RUN-DOWN-PRIM ARIES- 
TO-SECONDARIES STATE 38, RUN-DOWN- 
SECONDARIES-TO-PRIMARIES STATE 34 or 
RUN-DOWN-BOTH-TO-NONE STATE 42. A first 
possible message which may be received during BOTH 
STATE 32 is a release primary message 172. This will 
cause second level controller 14 to set a timer 174, send 
a message 176 indicating it is running down primaries, 
and wait in RUN-DOWN-PRIMARIES-TO-SECON- 
D ARIES STATE 38. A second message which may be 
received is a release secondaries message 180. Upon 
receiving release secondaries message 180, second level 
controller 14 sets a timer 182 and sends a message 184 
indicating it has moved into RUN-DOWN-SECON- 
DARIES-TO-PRIMARIES STATE 34. A third possi- 
bility for second level controller 14 is to receive com- 
munication messages 186 from its partner second level 
controller. It will report back (188) and remain in 
BOTH STATE 32. Finally, second level controller 14 
may receive an instruction 190 telling it to release both 
primary and secondary sets of drives. At this point it 
sets the timer 192 and sends out a message 194 that it has 
released both primary and secondary drive sets. It will 
then remain in the RUN-DOWN-BOTH-TO-NONE 
STATE 42 until it receives further instructions from the 
other second level controller. 

FIG. 8G shows the various paths by which second 
level controller 14 moves from RUN-DOWN-PRI- 
MARIES-TO-SECONDARIES STATE 38 to one of 
either NONE STATE 28 or SECONDARY STATE 
36. The first possibility is that second level controller 14 
receives messages 196 from the other second level con- 
troller. It then reports back (198) and remains in RUN- 
DOWN-PRIMARIES-TO-SECONDARIES STATE 
38. A second possibility is that the timer (174), set be- 
tween BOTH STATE 32 and RUN-DOWN-PRI- 
MARIES-TO-SECONDARIES STATE 38 times out 
(200). At this point, second level controller 14 realizes 
that message 176 (FIG. 8F) was not properly sent. A 
complete failure has occurred. The second level con- 
troller reports (202) that it has released both sets of disk 
drives, and releases both sets (204). Second level con- 
troller 14 then enters NONE STATE 28. Finally, a run 
down path response message 206 is received acknowl- 
edging receipt of message 176 (FIG. 8F) sent between 
BOTH STATE 32 and RUN-DOWN-PRIMARIES- 
TO-SECONDARIES STATE 38. Second level con- 
troller 14 releases its primary drives 208 and enters 
SECONDARY STATE 36. 

FIG. 8H shows the possible paths down which sec- 
ond level controller 14 moves between RUN-DOWN- 
SECONDARIES-TO-PRIMARIES STATE 34 and 
one of either NONE STATE 28 or PRIMARY 
STATE 26. A first possibility is that second level con- 
troller 14 receives a message 210 from the other second 
level controller. It then reports back (212) and rema'ins 
in RUN-DOWN-SECONDARIES-TO-PRIMARIES 
STATE 34. A second possibility is that the timer (182), 
set between BOTH STATE 32 and RUN-DOWN- 
SECONDARIES-TO PRIMARY-STATE 34 times 
out (214). If this occurs, second level controller 14 real- 
izes that message 184 (FIG. 8F) was not properly sent. 
A complete failure has occurred. Second level control- 
ler then sends a message 216 indicating that it has re- 
leased its drives and then it releases both primary and 
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secondary disk drive sets (218) which it controls. Sec- 
ond level controller then moves into NONE STATE 
28. Finally, a third possibility is that second level con- 
troller 14 receives a response 220 to message 184 (FIG. 
8F) sent between BOTH STATE 32 and RUN- 5 
DOWN-SECONDARIES-TO-PRIMARIES-STATE 
34. It will then release (222) its secondary drives and 
enter PRIMARY STATE 26. 

FIG. 81 shows the possible paths illustrating the tran- 
sition of second level controller between RUN- 10 
DOWN-BOTH-TO-NONE STATE 42 and NONE 
STATE 28. Three possible events may take place. First, 
a message 230 may be received from the other second 
level controller providing communication information. 
In this case, second level controller 14 reports back 15 
messages 232 and remains in RUN-DOWN-BOTH-TO- 
NONE STATE 42. The second event which may occur 
is for the timer (192), set during transition from BOTH 
STATE 32 to RUN-DOW N-BOTH-TO-NONE 
STATE 42, to time out (234). If this happens, second 20 
level controller 14 realizes that message 194 (FIG. 8F) 
sent during BOTH STATE 32 didn't get properly sent 
and that there has been a complete failure. It releases 
control of both its primaries and secondary disk drives 
(238). It then ends up in NONE STATE 28. The third 25 
event which may occur while in RUN-DOWN-BOTH- 
TO-NONE STATE 42 is for a response to be received 
(240) from message 194 (FIG. 8F) sent out while second 
level controller moved from BOTH STATE 32 to 
RUN-DOWN-BOTH-TO-NONE STATE 42. This 
response indicates that the message was properly re- 
ceived. Second level controller 14 then releases both 
sets of drives (242) and ends up in NONE STATE 28. 

Rerouting Data Paths Between Buffers and Disk Drives 35 

FIG. 9 illustrates a first preferred embodiment of 
circuitry for rerouting data paths between buffers and 
disk drives 20. In FIG. 9, X-bar switches 310 through 
315 are coupled to a bus 309 communicating with the 40 
second level controller engine (see FIGS. 3 and 4). In 
turn, each X-bar switch is coupled by a bus to disk 
drives 20A1 through 20A6 and to each buffer 330 
through 336. Bus 350 couples each buffer to a first level 
controller which are coupled to a computer such as 45 
computer 10 (FIGS. 3 and 4). In this embodiment, al- 
though only six disk drives are illustrated, any arbitrary 
number could be used, as long as the illustrated archi- 
tecture is preserved by increasing the number of X-bar 
switches and output buffers in a like manner and main- 50 
taining the interconnected bus structures illustrated in 
FIG. 9. 

In operation, the second level controller will load 
various registers (not illustrated herein) which config- . 
ure the X-bar switches to communicate with particular 55 
buffers and particular disk drives. The particular config- 
uration can be changed at any time while the system is 
operating. Data flow is bi-directional over all the buses. 
By configuring the X-bar switches, data flowing from 
any given buffer may be sent to any given disk drive or 60 
vice versa. Failure of any particular system element 
does not result in any significant performance degrada- 
tion, as data flow can be routed around the failed ele- 
ment by reconfiguring the registers for the X-bar 
switch. In a preferred mode of operation, data may be 65 
transferred from or to a particular disk drive in parallel 
with other data transfers occurring in parallel on every 
other disk drive. This mode of operation allows for a 
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very high input/output rate of data as well as a high 
data throughput. 

To illustrate this embodiment's mode of operation, 
the following example is offered. Referring to FIG. 9, 
assume that all data flow is initially direct, meaning, for 
example, that data in buffer 330 flows directly through 
X-bar switch 110 to disk drive 20A1. Were buffer 330 to 
fail, the registers of X-bar switch 310 could be recon- 
figured, enabling X-bar switch 310 to read data from 
buffer 335 and direct that data to disk drive 20A1. Simi- 
lar failures in other buffers and in the disk drives could 
be compensated for in the same manner. 

Generation of Redundancy Terms and Error Detection 
on Parallel Data 

FIG. 10 illustrates a second preferred embodiment of 
the present invention. This second embodiment incor- 
porates Array Correction Circuits ("ACCs") to provide 
error detection and correction capabilities within the 
same general architecture as illustrated for the first 
preferred embodiment shown in FIG. 9. To ease the 
understanding of this embodiment, the full details of the 
internal structure of both the X-bar switches (310 
through 315) and the ACC circuits 360 and 370 are not 
shown in FIG. 10. FIGS. 11 and 12 illustrate the inter- 
nal structure of these devices and will be referenced and 
discussed in turn. Additionally, bus LBE as illustrated 
in FIG. 10 does not actually couple the second level 
controller (FIGS. 3 and 4) directly to the X-bar 
switches, the ACCs, and the DSI units. Instead, the 
second level controller communicates with various sets 
of registers assigned to the X-bar switches, the ACCs 
and the DSI units. These registers are loaded by the 
second level controller with the configuration data 
which establishes the operating modes of the aforemen- 
tioned components. As such registers are known, and 
their operation incidental to the present invention, they 
are not illustrated or discussed further herein. 

The embodiment shown in FIG. 10 shows data disk 
drives 20A1 through 20A4 and P and Q redundancy 
term drives 20A5 and 20A6. A preferred embodiment 
of the present invention utilizes 13 disk drives: ten for 
data, two for P and Q redundancy terms, and one spare 
or backup drive. It will be understood that the exact 
number of drives, and their exact utilization may vary 
without in any way changing the present invention. 
Each disk drive is coupled by a bi-directional bus (Small 
Computer Standard Interface) to units 340 through 345, 
herein labelled DSI. The DSI units perform some error 
detecting functions as well as buffering data flow into 
and out of the disk drives. 

Each DSI unit is in turn coupled by a bi-directional 
bus means to an X-bar switch, the X-bar switches herein 
numbered 310 through 315. The X-bar switches are 
coupled in turn to word assemblers 350 through 355 by 
means of a bi-directional bus. The bus width in this 
embodiment is 9 bits, 8 for data, 1 for a parity bit. The 
word assemblers assemble 36-bit (32 data and 4 parity) 
words for transmission to buffers 330 through 335 over 
bi-directional buses having a 36-bit width. When data 
flows from the output buffers to the X-bar switches, the 
word assemblers decompose the 36-bit words into the 
9-bits of data and parity. 

The X-bar switches are also coupled to ACC units 
348 and 349. The interconnection between the X-bar 
switches and the ACCs is shown in more detail in FIG. 
11. Each X-bar switch can send to both or either ACC 
the 8 bits of data and 1 parity bit that the X-bar switch 
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receives from either the DSI units or the word assem- 
blers. In turn, the X-bar switches can receive 9 bits of Redundancy Generation and Error Checking 
the P and redundancy terms calculated by the ACCs Equations 
over lines E| and E2. As shown, the ACCs can direct The main functional components of the second pre- 
the P and Q redundancy terms to any X-bar switch, not 5 ferred embodiment and their physical connections to 
being limited to the disk drives labelled P and Q. De- one another have now been described. The various 
pending on the configuration commanded by the sec- preferred modes of operation will now be described. In 
ond level controller, ACCs 348 and 349 can be mutually order to understand these functional modes, some un- 
redundant, in which case the failure of one or the other derstanding of the error detection and correction 
ACC does not affect the system's ability to detect or 10 method used by the present invention will be necessary, 
correct errors, or each ACC can detect and correct Various error detection and correction codes are 
errors on a portion of the total set of disk drives. When known and used in the computer industry. Error-Con- 
operating in this second manner, certain specific types trol Coding and Applications, D. Wiggert, The MITRE 
of operations which write data to individual disk drives Corp., describes various such codes and their calcula- 
are expedited, as each ACC can write to a separate 15 tion. The present invention in this second preferred 
individual disk drive. The specific disk drives that the embodiment is implemented using a Reed-Solomon 
individual ACCs monitor can be reconfigured at any error detection and correction code. Nothing herein 
time by the second level controller. should be taken to limit the present invention to using 

The illustrated connections of the ACCs and the 2Q onlv a Reed-Solomon code. If other codes were used, 

X-bar switches also allows data to be switched from any various modifications to the ACCs would be necessary, 

X-bar switch to any ACC once the second level con- DUt tne se modifications would in no way change the 

troller configures the related registers. This flexibility essential features of this invention, 

allows data to be routed away from any failed disk drive Reed-Solomon codes are generated by means of a 

or buffer. 25 field generator polynomial, the one used in this embodi- 

FIG. 11 shows important internal details of the ACCs menl being X 4 +X+ 1. The code generator polynomial 

and the X-bar switches. X-bar switch 310 is composed needed for this Reed -Solomon code is (X+a°)«(X- 

of two mirror-image sections. These sections comprise, +a , ) = X 2 -i-a 4 XH-a l . The generation and use of these 

respectively. 9-bit tristate registers 370/380, lo multi- codes t0 detect and correct errors is known, 

plexers 372/382, first 9-bit registers 374/384, second 30 The actual implementation of the Reed-Solomon 

9-bit registers 376/386, and input/output interfaces code in the P re sent invention requires the generation of 

379/389. In operation, data can flow either from the various terms and syndromes. For purposes of clarity, 

word assembler to the DSI unit or vice versa. these terms are generally referred to herein as the P and 

Although many pathwavs through the X-bar switch Q redundancy terms. The equations which generate the 
are possible, as shown by FIG. 11, two aspects of these 35 P and Q redundancy terms are: 
pathways are of particular importance. First, in order to p . , . . 
allow the ACC sufficient time to calculate P and Q ^-^-1+^-2+ ... +«Wo 
redundancy terms or to detect and correct errors, a data anc j 
pathway of several registers can be used, the data re- 
quiring one clock cycle to move from one register to <?=^ n -i*„-i+4,-2* n -2+ - . • +*/i-<m+<W 
the next. By clocking the data through several registers, 

a delay of sufficient length can be achieved. For exam- The P redundancy term is essentially the simple parity 
pie, assuming a data flow from the word assembler unit of all the data bytes enabled in the given calculation, 
to a disk drive, 9 bits are clocked into 9-bit register 374 The Q logic calculates the Q redundancy for all data 
and tri-state register 370 on the first clock pulse. On the 45 bytes that are enabled. For Q redundancy, input data 
next clock pulse, the data moves to 9-bit register 386 must first be multiplied by a constant "a" before it is 
and through redundancy circuit 302 in the ACC 348 to summed. The logic operations necessary to produce the 
P/Q registers 304 and 306. The next clock pulses move P and Q redundancy terms are shown in FIGS. 12a and 
the data to the DSI unit. lib. All operations denoted by $ are exclusive-OR 
The second important aspect of the internal pathways ("XOR") operations. Essentially, the final P term is the 
relates to the two tristate registers. The tri-state regis- sum ^ s\\ P/ terms. The Q term is derived by multiply- 
ters are not allowed to be active simultaneously. In in S a M Q/ terms by a constant and then XORing the 
other words, if either tristate register 370 or 380 is en- results. These calculations occur in redundancy circuit 
abled, its counterpart is disabled. This controls data 55 302 in ACC 260 (FIG. 11). The second preferred em- 
transmission from the X-switch to the ACC. The data ' bodiment, using its implementation of the Reed-Solo- 
may flow only from the DSI unit to the ACC or from mon code > is able to correct the data on up to two failed 
the word assembler to the ACC, but not from both to disk dr * v es. 

the ACC simultaneously. In the opposite direction, data ~^h e correction of data requires the generation, of 

may flow from the ACC to the word assembler and the 60 additional terms So and S] within the ACC. Assuming 

DSI simultaneously. tnat the P and Q redundancy terms have already been 

ACC unit 348 comprises a redundancy circuit 302, calculated for a group of data bytes, the syndrome equa- 

wherein P and Q redundancy terms are generated, P tions 



So«rf»-i+rf„_2+ ... + rfi + rfo-j./> 



and Q registers 304 and 306, wherein the P and Q redun- 
dancy terms are stored temporarily, regenerator and 65 
corrector circuit 308, wherein the data from or to a c 1=3 / rf „ 

failed disk drive or buffer can be regenerated or cor- +(rfi-ai)+(do-flo)+£> 
rected, and output to interfaces 390, 391, 392 and 393. 
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are used to calculate So and Si. For So an ACC register Although FIG. 10 only shows four data drives and 
enables the necessary data bytes and the P redundancy the P and Q redundancy term drives, a preferred em- 
to be used in the calculation. For Si, the necessary input bodiment uses a set of 13 disk drives, 10 for data 2 for 
data must first be multiplied by a/ before being summed the P and Q terms, and a spare. Although nothing 
with the Q redundancy information. 5 herein should be construed to limit this discussion to 

As stated, an ACC can correct the data on up to two that specific embodiment, parallel processing opera- 
failed disk dnves in this embodiment. The failed disk tions will be described with relation to that environ- 
dnve register (not illustrated) in the relevant ACC will ment. 
be loaded with the address of the failed disk or disks by 

the second level controller. A constant circuit within 10 Parallel Processing Operations 

the ACC will use the drive location information to In parallel processing operations, all the drives are 
calculate two constants k 0 and kj as indicated in Table 1 considered to comprise a single large set. Each of the 
below, where i represents the address of the first failed disk drives will either receive or transmit 9 bits of data 
disk drive, ) is the address of the second failed disk simultaneously. The result of this is that the 9-bits of 
drive, and a is a constant. The columns labelled Failed 15 data appearing in the DSI units of all the drives simulta- 
Dnves indicate which dnves have failed. Column k 0 neously are treated as one large codeword. This result is 
and k| i indicate how those constants are calculated given shown in FIG. 13a Codeword 400 comprises 9 bits of 
the failure of the dnves noted in the Failed Drives col- data from or for disk drive d„_ u 9 bits of data from or 
umns * for disk drive d„_ 2 , and so on, with the P and Q disk 

TABLE 1 20 driv « receiving or transmitting the P and Q redun- 

dancy term. In a parallel write operation, all the disk 
drives in the set, except for the spare disk drive, will 
receive a byte of data (or a redundancy term whose 
length is equal to the data byte) simultaneously. As 
25 shown, the same sector in all the disk drives will receive 
a part of codeword 400. For example, in the illustration, 
sector 1 of disk drive n-1 will receive a byte of data 
designated d„-i from codeword 400, sector 1 of disk 
» _ drivc n -2 will receive a byte of data designated d„_2 

The error correction circuits use the syndrome informa- 30 from codeword 400 and so on. 
tion So and Si, as well as the two constants koand ki to In the actual implementation of this preferred em- 
generate the data contained on the failed disk drives. bodiment, the codewords are "striped" across the vari- 
The error correction equations are as follows: ous disk drives. This means that for-each successive 

f -Snk +s l- codeword, different disk drives receive the P and Q 

1 0 11 35 redundancy terms. In other words, drive d„_ i is treated 

^=5 0 +£| as drive dfl - 2 for the second codeword and so on, until 

w «at was originally drive d ff _i receives a Q redun- 

Fi is the replacement data for the first failed disk drive. l ancy J en71 * .T* 1 " 5 ! the redundancy terms "stripe" 
F 2 is the replacement data for the second failed disk ^ thrOUgh the disk dnves * 
drive. The equations which generate the P and Q redun- Pairs of P and Q Terms for Nibbles 

dancy terms are realized in combinatorial logic, as is 0 , , . , ~ 

partially shown in FIGS. 12a and 126. This has the Calculating the P and Q redundancy terms using 8-bit 
advantage of allowing the redundancy terms to be gen- svmb P ls would require a great deal of hardware. To 
erated and written to the disk drives at the same time AC l? Ce th J s hardwar e overhead, the calculations are 
that the data is written to the drives. This mode of P er \ ormed usln £ *-bit bytes or nibbles. This hardware 
operation will be discussed later. implementation does not change the invention concep- 

tually, but does result in the disk drives receiving two 
Operational Modes 4-bit data nibbles combined to make one 8-bit byte. In 

Having described the aspects of the Reed-Solomon <A f IG ' A 13 ,' t 450 '. ™ wc " as the iIlus ™ed sec- 
code implementation necessary to understand the pres- 5 0rs . A of the T* dnves ' lllust r*te how the codeword is 
ent invention, the operational modes of the present f " ^ ow * c disk d "ves receive upper and 

invention will now be discussed. lower 4 * blt mbbles - TabJ e 2 shows how, for codewords 

The second preferred embodiment of the present °? e t] ] TOU &\ N ' » different portion of the codeword is 
invention operates primarily in one of two classes of « P ' aCCd °V different dnves. Each disk drive, for a 
operations. These are parallel data storage operations 55 8 !*5? c °*™ori f receives an upper and lower 4-bit 
and transaction processing operation. These two classes ?v!. c : de ?|S na * cd Wlth L ' s and u ' s - of the codeword, 
of operations will now be discussed with reference to Additionally, the same section is used to store the nib- 
the figures, particularly FIGS. 10, 13 and 14 and Tables ? n T of thc disk drivcs used t0 store th e code- 

2 through 7. word * In other words, for codeword j , the first sector of 

60 disk dnves n-1 through 0 receives the nibbles. 

TABLE 2 

CODEWORD - DATA AND P AND O 
Sector of Sector of Sector of Sector of Sector of 
Drive d n .) Drive d ff -2 Drive dp Drive P Drive Q 



Codeword i Codeword | Codeword i Codeword j Codeword j Codeword) 

n a , (jWXdnn/) (d ff .2 L Xd 0 . 2C /) <do L Xdot/) (Piz.XP to 0 (Qiz.XQm) 

Codeword: Codeword: Codeword: Codeword 2 Codeword 2 Codeword; 

(dfl-lLXdn-n;) (d n . 2 J(d fl .3|/) (dOL^OU* PilWiu) (Q2Z*Q2t/) 



12/11/2003, EAST version: 1.4.1 



5,274,645 

21 22 

TABLE 2-continued 



CODEWORD - DATA AND P AND Q 
Sector of Secior of Sector of Sector of Sector of 
Drive d n .| Drive Drive dp Drive P Drive 0 



Codeword „ Codeword,, Codeword „ Codeword,, Codeword,, Codeword,, 
(oViiHcUnO (d n .; L yd n .2 tf ) (dQ£Kd ot/ ) (P„ L M? nV ) (Q nI )(Q nfj ) 



Referring back to FIG. 10, for a parallel data write to 
the disks, the data is provided in parallel from buffers 
330, 331. 332 and 333 along those data buses coupling 
the buffers to X-bar switches 310, 311, 312, and 313 after 
the 36-biis of data are disassembled in word assemblers 
350 through 353 into 9-bit words. These X-bar switches 
are also coupled to inputs D3 ( D2, Dl an DO, respec- 
tively, of ACC 34S and ACC 349. In parallel processing 
modes, the two ACCs act as mutual "backups" to one 
another. Should all fail, the other will still perform the 
necessary error correcting functions. In addition to 
operating in a purely "backup" condition, the second 
level controller engine configures the ACCs so that 
each ACC is performing the error detection and correc- 
tion functions for a portion of the set, the other ACC 
performing these functions for the remaining disk drives 
in the set. As the ACC units are still coupled to all the 
disk drives, failure of one or the other unit does not 
impact the system as the operating ACC can be recon- 
figured to act as the dedicated ACC unit for the entire 
set. For purposes of discussion, it is assumed here that 
ACC 348 is operating. ACC 348 will calculate the P and 
Q redundancy term for the data in the X-bar switches 
and provide the terms to its Ei and E2 outputs, which 
outputs are coupled to all the X-bar switches. For dis- 
cussion only, it is assumed that only the E2 connection 
of X-bar switch 314 and the Ej connection of X-bar 
switch 315 are enabled. Thus, although the data is pro- 
vided along the buses coupling ACC 348 s Ei and E2 
output to all the X-bar switches, the Q term is received 
only by X-bar switch 314 and the P term is received by 
X-bar switch 315. Then, the Q and P terms are provided 
first to DSI units 344 and 345 and then disk drives 20A5 
and 20A6. It should be recalled that the various internal 
registers in the X-bar switches will act as a multi-stage 
pipeline, effectively slowing the transit of data through 
the switches sufficiently to allow ACC 348's redun- 
dancy circuit 302 to calculate the P and Q redundancy 
terms. 

As ACC 349 is coupled to the X-bar switches in a 
substantially identical manner to ACC 348, the opera- 
tion of the system when ACC 349 is operational is es- 
sentially identical to that described for ACC 348. 

Subsequent parallel reads from the disks occur in the 
following manner. Data is provided on bi-directional 
buses to DSI units 340, 341, 342 and 343. P and Q redun- 
dancy terms are provided by DSI units 345 and 344, 
respectively. As the data and P and Q terms are being 
transferred through X-bar switches 310 through 315, 
ACC 348 uses the P and Q terms to determine if the data 
being received from the disk drives is correct. Word 
assemblers 350 through 353 assemble successive 9-bit 
words until the next 36-bitb are available. This 36-bits 
are forwarded to buffers 330 through 333. Note that the 
9-bit words are transmitted to the buffers in parallel. If 
that data is incorrect, the second level controller will be 
informed. 



During a parallel read operation, in the event that 
there is a failure of a disk drive, the failed disk drive 

15 will, in certain instances, communicate to the second 
level controller that it has failed. The disk drive will 
communicate with the second level controller if the 
disk drive cannot correct the error using its own correc- 
tor. The second level controller will then communicate 

20 with ACCs 348 and 349 by loading the failed drive 
registers in the ACC (not shown in the figures) with the 
address of the failed drive. The failed drive can be re- 
moved from the set by deleting its address from the 
configuration registers. One of the set* s spare drives can 

25 then be used in place of the failed drive by inserting the 
address of the spare drive into the configuration regis- 
ters. 

The ACC will then calculate the replacement data 
necessary to rewrite all the information that was on the 

30 failed disk onto the newly activated spare. In this inven- 
tion, the term spare or backup drive indicates a disk 
drive which ordinarily does not receive or transmit data 
until another disk drive in the system has failed. 
When the data, P, and Q bytes are received, the ACC 

35 circuits use the failed drive location in the failed drive 
registers to direct the calculation of the replacement 
data for the failed drive. After the calculation is com- 
plete, the data bytes, including the recovered data, are 
sent to data buffers in parallel. Up to two failed drives 

40 can be tolerated with the Reed-Solomon code imple- 
mented herein. All operations to replace failed disk 
drives and the data thereon occur when the system is 
operating in a parallel mode. 
Regeneration of data occurs under second level con- 

45 troller control. When a failed disk drive is to be re- 
placed, the ACC regenerates all the data for the re- 
placement disk. Read/write operations are required 
until all the data has been replaced. The regeneration of 
the disk takes a substantial amount of time, as the pro- 

50 cess occurs in the background of the system's opera- 
tions so as to reduce the impact to normal data transfer 
functions. Table 3 below shows the actions taken for 
regeneration reads. In Table 3, i represents a first failed 
drive and j represents a second failed drive. In Table 3, 

55 the column labelled Failed Drives indicates the particu- 
lar drives that have failed. The last column describes the 
task of the ACC given the particular indicated failure. 

m TABLE 3 

^ Regeneration Read 

W Failed 

Drives 

P — ACC calculates P redundancy 
Q — ACC calculates Q redundancy 
> — ACC calculates replacement data for i drive 
65 * P ACC calculates replacement data for i drive 

and P redundancy 
Q i ACC calculates replacement data for i drive 

and Q redundancy 
j i ACC calculates replacement data for i and j drives 
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Failed 
Drives 



Regeneration Read 



P Q ACC calculates P and Q redundancy 



strutted data is provided to its buffer, buffer 332, since 
this is the only data the external computer needs. 

Transaction Processing Mode: Write 
When any individual drive is written to, the P and Q 
redundancy terms must also be changed to reflect the 
new data (see FIG. 18). This is because the data being 
written over was part of a code word extending over 
multiple disk drives and having P and Q terms on two 



It should be noted that if both a data disk drive and a 

redundancy disk drive M the data on the data disk ., 1UI „ FJC UISK onvcs ana nav r and Q terms on two 
dnye must be regenerated before the redundancy terms 10 disk drives. The previously stored P and Q terms will 

on the redundant H™> TW,™ * ™ — - no longer be valid when part of the codeword is 

changed, so new P and Q terms, P" and Q", must be 
calculated and written over the old P and Q terms on 
their respective disk drives. P" and Q" will then be 
proper redundancy terms for the modified code word. 

One possible way to calculate P" and Q" is to read 
out the whole codeword and store it in the buffers. The 
new portion of the codeword for drive 20C1 can then 
be supplied to the ACC circuit along with the rest of the 



15 



on the redundancy drive. During a regeneration write, 
regeneration data or redundancy terms are written to a 
disk and no action is required from the ACC logic. 

During a parallel read operation, it should also be 
noted that additional error detection may be provided 
by the ACC circuitry. 

Table 4 indicates what actions may be taken by the 
ACC logic unit when the indicated drive(s) has or have 

failed during a failed drive read operation. In this opera- oe suppned to the ACC circuit along with the rest of the 
Ire ^nnlf^ CoIu # mns c <*teword, and the new P" and Q" can be calculated 

are nnwn tn avp «M r,n rt r tr. th~ r~*A an( j store<J Qn ^ driyes a$ fof fl norma j para j] e | 

write. However, if this method is used, it is not possible 
to simultaneously do another transaction mode access 
25 of a separate disk drive (i.e., drive 20A1) having part of 
the codeword, since that drive (20A1) and its buffer are 
needed for the transaction mode write for the first drive 
(20C1). 

According to a method of the present invention, two 
30 simultaneous transaction mode accesses are made possi- 
ble by using only the old data to be written over and the 
old P and Q to calculate the new P" and Q" for the new 
data. This is done by calculating an intermediate P' and 
Q' from the old data and old P and Q, and then using P' 
35 and Q' with the new data to calculate the new P" and 
Q". This requires a read-modify- write operation on the 
P and Q drives. The equations for the new P and Q 
redundancy are: 



are known to have failed prior to the read operation 
The last column indicates the ACC response to the 
given failure. 

TABLE 4 



Failed 
Drives 


P 


No action by ACC 


Q 


No action by ACC 


i 


ACC calculates replacement daia 


i P 


ACC calculates the replacement data 


0 i 


ACC calculates ihe replacement data 


i j 


ACC calculates replacement data 


P Q 


No action by ACC 
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Transaction Processing Mode: Read 

Transaction processing applications require the abil- 
ity to access each disk drive independently. Although 
each disk drive is independent, the ACC codeword 
with P and Q redundancy is maintained across the set in 
the previously described manner. For a normal read 
operation, the ACC circuitry is not generally needed. If 
only a single drive is read, the ACC cannot do its calcu- 
lations since it needs the data from the other drives to 45 
assemble the entire codeword to recalculate P and Q 
and compare it to the stored P and Q. Thus, the data is 
assumed to be valid and is read without using the ACC 
circuitry (see FIG. 15). Where drive 20C1 is the one 



New P redundancy (P")= 
data 



'(old P-old data) + new 



New 0 redundancy (Q")=(old Q-oId 
dataa,) + new data -a/ 

P'=oJd P-old data 
Q'=oId Q-old data-a/ 
Where a/ is the coefficient from the syndrome equa- 



j x — _ — — /• -.-WW * j ▼ w «wv,« w »w wut a/ is uic uucujuiem i rum me 

selected, the data is simply passed through DSI unit 342 50 tion Si; and i is the index of the drive 



X-bar switch 312, word assembler 352 and buffer 332 to 
the external computer. If the disk drive has failed, the 
read operation is the same as a failed drive read in paral- 
lel mode with the exception that only the replacement 
data generated by ihe ACC is sent to the data buffer. In 
this case, the disk drive must notify the second level 
controller that it has failed, or the second level control- 
ler must otherwise detect the failure. Otherwise, the 
second level controller will not know that it should read 



During the read portion of the read-modify-write, the 
data from the drive to be written to and the P and Q 
drives are summed by the ACC logic, as illustrated in 
FIG. 17. This summing operation produces the P' and 
55 Q' data. The prime data is sent to a data buffer. When 
the new data is in a data bufTer, the write portion of the 
cycle begins as illustrated in FIG. 18. During this por- 
tion of the cycle, the new data and the P' and Q' data are 
summed by the ACC logic to generate the new P" and 



«™.w .~ T w. w„wn V ..v.j 1MJl Miuw ui«i u miuuiu rcau ™~\~ jugiu lo generate tne new P an< 

all the drives, unless it assumes that there might be an 60 Q" redundancy. When the summing operation is com 

prrnr in tVio Ant a ran A 4V/-tm *Ua A^~l-~A .-1 ~ . . ~ "ri r • i _ j mIaI-a tka * ^* i. , . . 



error in the data read from the desired drive. The failed 
drive read is illustrated in FIG. 16, with drive 20C1 
having the desired data, as in the example of FIG. 15. In 
FIG, 16, the second level controller knows that drive 
20C1 has failed, so the second level controller calls for 65 
a read of all drives except the failed drive, with the 
drive 20C1 data being reconstructed from the data on 
the other drives and the P and Q terms. Only the recon- 



- — C> — f WVMJI- 

plete, the new data is sent to the disk drive and the 
redundancy information is sent to the P and Q drives. 

Parity Check of P and Q for Transaction Mode Write 
During these read-modify-write operations, it is also 
possible that the ACC unit itself may fail. In this case, if 
the data in a single element were to be changed by a 
read-modify-write operation, a hardware failure in the 
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Failed 
Drives 



P 

Q 



ACC might result in the redundancy bytes for the new 
data being calculated erroneously. To prevent this oc- 
currence, the parity detector and parity generator are 
made part of the ACC circuitry. This additional redun- 
dant circuit is shown in FIGS. 14a and 146 and resides 5 
within redundancy circuit 302 as shown in FIG. 11. 
When data is received by the ACC circuitry, parity is 
checked to insure that no errors have occurred using 
the P and Q redundancy terms. In calculating Q", new 
parity is generated for the product of the multiply oper- 10 
ation and is summed with the parity of the old Q" term. 
This creates the parity for the new Q term. For the P 
byte, the parity bits from the data are summed with the 
parity bit of the old P term to create the new parity bit 
for the new P" term. Before writing the new data back 15 
to the disk drive, the parity of Q' (calculated as indi- 
cated previously) is checked. Should Q' be incorrect, 
the second level controller engine will be informed of 
an ACC failure. In this manner, a failure in the ACC can 
be detected. 

The same operations are performed for a failed disk 
drive write in transaction processing operations as for 
parallel data writes, except that data is not written to a 
failed drive or drives. 

With respect to transaction processing functions dur- 
ing normal read operations, no action is required from 
the ACC logic. The actions taken by the ACC logic 
during a failed drive read in transaction processing ode 
are listed in Table 5 below, where i and j represent the 
first and second failed drives. The columns labelled 30 
Failed Drives indicate which drives have failed. The 
last column indicates what action the ACC may or may 
not take in response to the indicated failure. 

TABLE 5 



20 



25 



40 



Redundancy drives are not read: no ACC action 

Redundancy drives are noi read; no ACC action 

ACC logic calculates replacement data and 

performs a parallel read 

ACC logic calculates replacement data and 

performs a parallel read 

ACC logic calculates replacement data and 

performs a parallel read 

ACC logic calculates replacement data and 

performs a parallel read 

No ACC action as only data disk drives are read 



If two data disk drives fail, the ACC logic must calcu- 
late the needed replacement data for both disk drives. If 
only one failed drive is to be read, both failed drives 5 0 
must still be noted by the ACC logic. 

In the read-before-write operation (part of the read- 
modify- write process), the ACC logic generates P' and 
Q' redundancy terms. Table 6 shows the action taken by 
the ACC logic when a failed disk drive read precedes a 55 
write in this process. Again, i and j represent the first 
and second failed drives. The columns headed by Failed 
Drives indicate which drives have failed, and the last 
column denotes the response of the ACC to the indi- 
cated failures. 
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TABLE 6-continued 



Failed Drives 



Q i All good data disk drives are read into data 

buffers 

i j All good data disk drives are read into data 

buffers 

i failed drive Perform a parallel read, the ACC logic 

calculaieb the replacement data for the jth 
failed drive. Next, the remaining good data 
disk drives arc read into the data buffers. 

P Q No read before write operation is necessary 

When a failed data disk drive is to be written, all good 
data disk drives must be read so that a new P and Q 
redundancy can be generated. All of the data from the 
good data disk drive and the write data is summed to 
generate the new redundancy. When two data disk 
drives fail, the ACC logic must calculate replacement 
data for both failed drives. If only one drive is to be 
read, both must be reported to the ACC logic. 

During write operations, the ACC continues to cal- 
culate P and Q redundancy. Table 7 shows the ACCs 
tasks during failed drive writes. Here P and Q represent 
the P and Q redundancy term disk drives, and i and j 
represent the first and second failed data disk drives. 
The columns Failed Drives denote the particular failed 
drives, and the last column indicates the ACC response 
to the failed drives. 

TABLE 7 

Failed 
Drives 



35 



P 
Q 



Q 

i 

P 



J 

Q 



ACC calculates Q redundancy only 
ACC calculates P redundancy only 
ACC calculates P and Q redundancy 
ACC calculates 0 redundancy only 
ACC calculates P redundancy only 
ACC calculates P and Q redundancy 
ACC logic takes no action 



TABLE 6 



Failed Drive* 



P 

0 



ACC calculate! Q' only 

ACC calculates P' only 

ACC logic takes no action and all good data 

disk drives are read into data buffers 

All good.data disk drives are read into data 

buffers 



Summary of ECC 

The interconnected arrangements herein described 
relative to both preferred embodiments of the present 
invention allow for the simultaneous transmission of 
45 data from all disks to the word assemblers or vice versa. 
Data from or to any given disk drive may be routed to 
any other word assembler through the X-bar switches 
under second level controller engine control. Addition- 
ally, data in any word assembler may be routed to any 
disk drive through the X-bar switches. The ACC units 
receive all data from all X-bar switches simultaneously. 
Any given disk drive, if it fails, can be removed from 
the network at any time. The X-bar switches provide 
alternative pathways to route data or P and Q terms 
around the failed component. 

The parallel arrangement of disk drives and X-bar 
switches creates an extremely fault-tolerant system. In 
the prior art, a single bus feeds the data from several 
disk drives into a single large buffer. In the present 
invention, the buffers are small and one buffer is as- 
signed to each disk drive. The X-bar switches, under 
control of the ACC units, can route data from any given 
disk drive to any given buffer and vice versa. Each 
second level controller has several spare disks and one 
65 spare buffer coupled to it. The failure of any two disks 
can be easily accommodated by switching the failed 
disk from the configuration by means of its X-bar 
switch and switching one of the spare disks onto the 



60 
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network. The present invention thus uses the error into various "extents", each defined as a portion of the 
detection and correction capabilities of a Reed-Solo- depth of the redundancy group and each capable of 
mon error correction code in an operational environ- having a configuration of check data different from that 
ment where the system's full operational capabilities of other extents in the same redundancy group. More- 
can be maintained by reconfiguring the system to cope 5 over, it has been found that more than one redundancy 
with any detected disk or buffer failure. The ACC can group can be provided in a single device set, under the 
correct and regenerate the data for the failed disk drive control of a single "array controller", and connected to 
and, by reconfiguring the registers of the failed and a main processing unit via one or more device control- 
spare disk drives, effectively remove the failed drive lers. 

from the system and regenerate or reconstruct the data 10 Similarly, in previously known device sets, the single 

from the failed disk onto the spare disk. redundancy group included only one data group for 

Disk Drive Configuration and Format •PpMcition the device set operated as a single 

e logical device. It has been found, however, that a re- 

The present invention allows a set of physical mass dundancy group can be broken up into multiple data 
data storage devices to be dynamically configured as 15 groups, each of which can operate as a separate logical 
one or more logical mass storage devices. In accor- storage device or as part of a larger logical storage 
dance with the present invention, such a set of physical device. A data group can include all available mass 
devices is configurable as one or more redundancy storage memory on a single physical device (i.e., all 
groups and each redundancy group is configurable as memory on the device available for storing application 
one or more data groups. 20 data), or it can include all available mass storage mem- 

A redundancy group, as previously used in known ory on a plurality of physical devices in the redundancy 
device sets, is a group of physical devices all of which group. Alternatively, as explained more fully below, a 
share the same redundant device set. A redundant de- data group can include several physical devices, but 
vice is a device that stores duplicated data or check data instead of including all available mass storage memory 
for purposes of recovering stored data if one or more of 25 of each device might only include a portion of the avail- 
the physical devices of the group fails. able mass storage memory of each device. In addition, it 

Where check data is involved, the designation of a has been found that it is possible to allow data groups 
particular physical device as a redundant device for an from different redundancy groups to form a single logi- 
entire redundancy group requires that the redundant cal device. This is accomplished, as will be more fully 
device be accessed for all write operations involving 30 described, by superimposing an additional logical layer 
any of the other physical devices in the group. There- on the redundancy and data groups, 
fore, all write operations for the group interfere with Moreover, in previously known device sets in which 
one another, even for small data accesses that involve application data is interleaved across the devices of the 
less than all of the data storage devices. set, the data organization or geometry is of a very sim- 

It is known to avoid this contention problem on write 35 pie form. Such sets generally do not permit difTerent 
operations by distributing check data throughout the logical organizations of application data in the same 
redundancy group, thus forming a logical redundant logical unit nor do they permit dynamic mapping of the 
device comprising portions of several or all devices of logical organization of application data in a logical unit, 
the redundancy group. For example, FIG. 19 shows a It has been found that the organization of data within a 
group of 13 disk storage devices. The columns represent 40 data group can be dynamically configured in a variety 
the various disks D1-D13 and the rows represent differ- of ways. Of particular importance, it has been found 
ent sectors S1-S5 on the disks. Sectors containing check that the data stripe depth of a data group can be made 
data are shown as hatched. Sector SI of disk D13 con- independent of redundancy group stripe depth, and can 
tains check data for sectors of disks D1-D12. Likewise, be varied from one data group to another within a logi- 
the remaining hatched sectors contain check data for 45 cal unit to provide optimal performance characteristics 
their respective sector rows. Thus, if data is written to for applications having difTerent data storage needs, 
sector S4 of disk D7, then updated check data is written An embodiment of a mass storage system 500 includ- 
ing sector S4 of disk D10. This is accomplished by ing two second level controllers 14A and 14B is shown 
reading the old check data, re-coding it using the new in the block- diagram of FIG. 20. As seen in FIG. 20, 
data, and writing the new check data to the disk. This 50 each of parallel sets 501 and 502 includes thirteen physi- 
operation is referred to as a read-modify-write. Simi- cal drives 503-515 and a second level controller 14. 
Iarly, if data is written to sector SI of disk Dll, then Second level controller 14 includes a microprocessor 
check data is written into sector SI of disk D13. Since 516a which controls how data is written and validated 
there is no overlap in this selection of four disks for across the drives of the parallel set. Microprocessor 
writes, both read-modify- write operations can be per- 55 516a also controls the update or regeneration of data 
formed in parallel when one of the physical drives malfunctions or loses 

A distribution of check data in a redundancy group in synchronization with the other physical drives of the 
the manner shown in FIG. 19 is known as a striped parallel set. In accordance with the present invention, 
check data configuration. The term "striped redun- microprocessor 516a in each second level controller' 14 
dancy group" will be used herein to refer generally to a 60 also controls the division of parallel sets 501 and 502 
redundancy group in which check data is arranged in a into redundancy groups, data groups and application 
striped configuration as shown in FIG. 19, and the term units. The redundancy groups, data groups and applica- 
"redundancy group stripe depth" will be used herein to tion units can be configured initially by the system oper- 
refer to the depth of each check data stripe in such a ator when the parallel set is installed, or they can be 
striped redundancy group. 65 configured at any time before use during run-time of the 

In previously known device sets, it was known to parallel set. Configuration can be accomplished, as de- 
provide the whole set as a single redundancy group. It scribed in greater detail below, by defining certain con- 
has been found that a redundancy group can be divided figuration parameters that are used in creating various 
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address maps in the program memory of microproces- 
sor 516o and, preferably, on each physical drive of the 
parallel set. 

Each of second level controllers 14A and 14B is con- 
nected to a pair of first level controllers 12A and 12B. 5 
Each first level controller is in turn connected by a bus 
or channel 522 to a CPU main memory. In general, each 
parallel set is attached to at least two sets of controllers 
so that there are at least two parallel paths from one or 
more CPU main memories to that parallel set. Thus, for 10 
example, each of the second level controllers 14A and 
14B is connected to first level controllers 12 A and 12B 
by buses 524 and 526. Such parallel data paths from a 
CPU to the parallel set are useful for routing data 
around a busy or failed first or second level controllers 15 
as described above. 

Within each parallel set are an active set 528 compris- 
ing disk drive units 503-514, and a backup set 530 com- 
prising disk drive unit 515. Second level controller 14 
routes data between first level controllers 12 and the 20 
appropriate one or ones of disk drive units 503-515. 
First level controllers 12 interface parallel sets 501 and 
502 to the main memories of one or more CPUS; and 
are responsible for processing I/O requests from appli- 
cations being run by those CPUs. A further description 25 
of various components of the apparatus of parallel sets 
501 and 502 and first level controllers 12 can be found in 
the following co-pending, commonly assigned U.S. 
patent applications incorporated herein in their entirety 
by reference: Ser. No. 07/487,648 entitled "NON- 30 
VOLATILE MEMORY STORAGE OF WRITE 
OPERATION IDENTIFIER IN DATA STORAGE 
DEVICE," filed in the names of David T. Powers, 
Randy Kat2, David H. Jaffe, Joseph S. Glider and 
Thomas E. Idleman; and Ser. No. 07/488,750 entitled 35 
"DATA CORRECTIONS APPLICABLE TO RE- 
DUNDANT ARRAYS OF INDEPENDENT 
DISKS/* filed in the names of David T. Powers, Joseph 
S. Glider and Thomas E. Idleman. 

To understand how data is spread among the various 40 
physical drives of an active set 528 of a parallel set 501 
or 502, it is necessary to understand the geometry of a 
single drive. FIG. 21 shows one side of the simplest type 
of disk drive — a single platter drive. Some disk drives 
have a single disk-shaped "platter" on both sides of 45 
which data can be stored. In more complex drives, there 
may be several platters on one "spindle," which is the 
central post about which the platters spin. 

As shown in FIG. 21, each side 600 of a disk platter 
is divided into geometric angles 601, of which eight are 50 
shown in FIG. 21, but of which there could be some 
other number. Side 600 is also divided into ring-shaped 
"tracks" of substantially equal width, of which seven 
are shown in FIG. 21. The intersection of a track and a 
geometric angle is known as a sector and typically is the 55 
most basic unit of storage in a disk drive system. There • 
are fifty-six sectors 603 shown in FIG. 21. 

A collection of tracks 602 of equal radius on several 
sides 600 of disk platters on a single spindle make up a 
"cylinder." Thus, in a single-platter two-sided drive, 60 
there are cylinders of height =2, the number of cylin- 
ders equalling the number of tracks 602 on a side 600. In 
a two-platter drive, then, the cylinder height would be 
4. In a one-sided single-platter drive, the cylinder height 
is 1. 65 

A disk drive is read and written by "read/write 
heads" that move over the surfaces of sides 600. FIG. 22 
shows the distribution of data sub-units— sectors, tracks 
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and cylinders— in a group 716 of eight single-platter 
two-sided drives 700-707 in a manner well-suited to 
illustrate the present invention. Drives 700-707 may, for 
example, correspond to drive units 503-510 of parallel 
set 501 or 502. Each of the small horizontal divisions 
represents a sector 708. For each drive, four cylinders 
709-712 are shown, each cylinder including two tracks 
713 and 714, each track including five sectors. 

In the preferred embodiment shown in FIG. 22, 
group 716 comprises a single redundancy group in 
which two types of redundancy data, referred to as "P" 
check data and "Q" check data, are used to provide data 
redundancy. The P and Q check data are the results of 
a Reed-Solomon coding algorithm applied to the mass 
storage data stored within the redundancy group. The 
particular method of redundancy used is implementa- 
tion specific. As shown, the redundancy data is distrib- 
uted across all spindles, or physical drives, of group 716, 
thus forming two logical check drives for the redun- 
dancy group comprising group 716. For example, the P 
and Q check data for the data in sectors 708 of cylinders 
709 of drives 700-705 are contained respectively in 
cylinders 709 of drives 706 and 707. Each time data is 
written to any sector 708 in any one of cylinders 709 of 
drives 700-705, a read-modify-write operation is per- 
formed on the P and Q check data contained in corre- 
sponding sectors of drives 706 and 707 to update the 
redundancy data. 

Likewise, cylinders 710 of drives 700-707 share P and 
Q check data contained in cylinders 710 of drives 704 
and 705; cylinders 711 of drives 700-707 share P and Q 
check data contained in cylinders 711 of drives 702 and 
703; and cylinders 712 of drives 700-707 share P and Q 
check data contained in cylinders 712 of drives 700 and 
701. 

Three data groups D1-D3 are shown in FIG. 22. 
Data group Dl includes cylinders 709 of each of spin- 
dles 700, 701. Data group D2 includes cylinders 709 of 
each of spindles 702, 703. Data group D3 includes all 
remaining cylinders of spindles 700-707, with the ex- 
ception of those cylinders containing P and Q check 
data. Data group Dl has a two-spindle bandwidth, data 
group D2 has a four-spindle bandwidth and data group 
D3 has a six-spindle bandwidth. Thus it is shown in 
FIG. 22 that, in accordance with the principles of the 
present invention, a redundancy group can comprise 
several data groups of different bandwidths. In addition, 
each of data groups D1-D3 may alone, or in combina- 
tion with any other data group, or groups, comprise a 
separate logical storage device. This can be accom- 
plished by defining each data group or combination as 
an individual application unit. Application units are 
discussed in greater detail below. 

In FIG. 22, sectors 708 are numbered within each 
data group as a sequence of logical data blocks. This 
sequence is defined when the data groups are config- 
ured, and can be arranged in a variety of ways. FIG. 22 
presents a relatively simple arrangement in which the 
sectors within each of data groups D1-D3 are num- 
bered from left to right in stripes crossing the width of 
the respective data group, each data stripe having a 
depth of one sector. This arrangement permits for the 
given bandwidth of each data group a maximum paral- 
lel transfer rate of consecutively numbered sectors. 

The term "data group stripe depth" is used herein to 
describe, for a given data group, the number of logically 
contiguous data sectors stored on a drive within the 
boundaries of a single stripe of data in that data group. 
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l^T^i W r'V he PrindpleS ° f 016 t re f ent inVen - distribution of redundancy groups and data 

lrT»',l^ P , T"? stT W m °y be , ,esser « han - groups over the active set 528 of a parallel set 501 or 502 

E a ° r Cq , rl- CP ^ 2 !! du " danc y S rou P «•> be parameterized. For example, the redundancy 

™ n, m ° f .! h,S ' FI °- 22 Sh °r 'i 3 ',* 1318 group can te characterized by . P redundancy group 

S Sh'EJ ^nn^,H a J a 6r T ^ depth ° u° ne 5 Wid ' h (in s P ind,es >' presenting the number of spindles 

sector, and are all included n a redundancy group hav- spanned by a particular set of check data, a redundancy 

R JuTJ^ ^°7 P , Stnpe , de 5! h ° f °" e Cy J indei \, gr0Up dCpth (in ^""it-sector, track or cyHnder) 
Redundancy group 716 can handle up to six data read and a redundancy group stripe depth (also in any 
™i U ™ s ™ u,,ane ° usI y-? ne each of spindles subunit-sector, track or cylinder). Data groups can be 
700-705-because the read/wnte heads of the spindles 10 characterized by width (in spindles), depth (in any 

«o U n°7 V ,V«? en r emly H ° f ° B n ,% an ?i he , r - Redu " danc y subunit-sector, track or cylinder). U data gr^ 

group 716 as configured in FIG. 22 also can handle stripe depth (also in any subunit sector, track or cylin- 

certain combinations of wnte requests s.multaneously. der). Because data groups do not start only at the berin- 

For example, in many instances any data sector of data ning of active set 528, they are also characterized by a 

group Dl can be written simultaneously with any data 15 "base", which is a two-parameter indication of the spin- 

sectors of data group D3 contained on spindles 702-705 die and the offset from the beginning ; of Zt Snd e at 

700 a 7o7™ a o C r k 7n7 UP ^ P ° r Q Ch6Ck d3,a °" Spind,eS Which ,he data * rou P ^rts. A redundancy group may] 
» a ? '] 06or707 - , ^ like a data group, include less than all of an entire spin 

fl ..vl n ?K y 8 r P - 716 ,f $ confi 8 ure # in WO- 22 »su- die. In addition, as previously stated bereirTa redun- 
ally cannot handle simultaneous wnte operations to 20 dancy group may be divided into a plurality of extents 
sectors m data groups Dl and D2, however, because to The extents of a redundancy gro^phav equal S 
perform a wnte operation in either of these data groups, and different bases and depths. For each extern the 
it is necessary to wnte to drives 706 and 707 as well. distribution of check data therein can be independently 
Only one wnte operation can be performed on the parameterized. In the preferred embodiment each "1 
check data of drives 706. 707 at any one time, because 25 dundancy group extent has additional internal ptrame- 
he read/wnte heads can only be in one place at one ters, such as the depth of each redundancy group stri™ 
time. Likewise, regardless of the d.stnbution of data within the redundancy group extent and the drive pos^ 
groups, write operations to any two data sectors backed tion of the P and Q check data for each such redun- 
up by check data on the same dnve cannot be done dancy group stripe 

simultaneously. The need for the read/write heads of 30 Redundancy group width reflects the trade-off be- 
he check dnves to be m more than one place at one tween reliability and capacity. If the redundancy group 
time can , be referred to as "collision/' width is high, then greater capacity is availabTe%Tu S e 

It is to be understood that the above-described re- only two drives out of a large number are used for 
strict™ concerning simultaneous writes to different check data, leaving the remaining drives for data. At 
data drives sharing common check drives is peculiar to 35 the other extreme, if the redundancy group width=4 
check drive systems, and is not a limitation of the inven- then a situation close to mirroring or shadowing in 
tion. For example, the restriction can be avoided by which 50% or the drives are used for check data exists 
implementing the invention using a. mirrored redun- (although with mirroring if the correct two drives out 
dancy group which does not have the property that of four fail, all data on them could be lost, while with 
different data dnves share redundancy data on the same 40 check data any two drives could be regenerated in that 
rir •» k situation). Thus low redundancy group widths repre- 

FIG 23 shows a more particularly preferred embodi- sent greater reliability, but lower capacity per unit cost 
ment of redundancy group 716 configured according to while high redundancy group widths represent ereater 
the present invention In FIG 23, as in FIG. 22, the capacity per unit cost" with bwer bu?s^^ 
logical check "drives" are spread among all of spindles 45 high, reliability 

700-707 on a per-cylinder basis, although they could Data group width reflects the trade-off discussed 
also be on a per-track basis or even a per-sector basis. above between bandwidth and request rate, with high 
Data groups Dl and D2 are configured as in FIG. 22. data group width reflecting high bandwidth and low 
The sectors of data group D3 of FIG. 22. however, data group width reflecting high request rates, 
have been divided among four data groups D4-D7. As 50 Data group stripe depth also reflects a trade-off be- 
can be seen in FIG. 23. the sequencing of sectors in data tween bandwidth and request rate. This trade-off varies 
groups D4-D7 is no longer the same as the single-sec- depending on the relationship of the average size of I/O 
tor-deep stnpmg of data groups Dl and D2. Data group requests to the data group and the depth of data stripes 
D4 has a data group stnpe depth of 20 sectors-equal to in the data group. The relationship of average I/O re 
the depth of the data group "self. Thus, in data group 55 quest size to the data group stripe depth governs how 
D41og,callynumberedsectors0-19canbereadconsec- often an I/O request to the datalrouj wHuJaTmore 
ut.vely by accessing only a single spindle 700, thereby than one read/write head within the data group it this 
allowing the read/wnte heads of spindles 701-707 to also governs bandwidth and request rate. If high band* 
handle other transactions Date groups D5, D6 and D7 width is favored, the date grou™ stripe depth is prefcra- 
each show examples of different intermediate data 60 bly chosen such that the ratio of average I/O request 

f-ZS P $ ° f 5 SeCt ° rS> 2 SCC,0rS 8nd * SeC, ° rS ' " 2e t0 Stripe de P th * ,ar « e A lar « e ratio 8 resX Z/i 

respectively. requests being more likely to span a plurality of data 

The distnbut.on of the check data over the various drives, such that the requested data can be accLed atl 

spmd les can be chosen in such a way as to minimize higher bandwidth than if the data were located all on 
collisions. Further, given a particular distribution, then 65 one drive. If, on the other hand, a h Jh requ«f ra te k 

toheextentthatsecondlevelcontrollerMhasachoice favored, the data group stripe depth t ' prefSy 

in the order of operations, the order can be chosen to chosen such that the ratio of I/O reWes size to da a 

rn.mm.ze collisions. group stri d „ „ smal ,_ 1 ^ s ; z e ° 
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lesser likelihood that an I/O request will span more than ing respectively 40 data blocks numbered 0-39 corre- 
one data drive, thus increasing the Hkelihood that multi- sponding to logical blocks LBN200-239 of logical unit 
pie I/O requests to the data group can be handled simul- LUN1, and 180 data blocks numbered 0-179 corre- 
taneously. sponding to logical blocks LBNO-LBN179 of logical 
The variance of the average size of I/O requests 5 unit LUN1. As shown by the example of FIG 24 the 
might also be taken into account in choosing data group logical blocks of a logical unit can be mapped as desired 
stnpe depth. For example, for a given average I/O to the data blocks of one or more data groups in a vari- 
request size, the data group stripe depth needed to ety of ways. Data group address space 806 also includes 
achieve a desired request rate might increase with an additional data groups (D4) and (D5) reserved for dy- 
mcrease in I/O request size variance. io namic configuration. These data groups can be format- 
In accordance with the present invention, the fiexibil- ted on the disk drives of the parallel set at initialization 
lty of a mass storage apparatus comprising a plurality of or at any time during the run-time of the parallel set but 
physical mass storage devices can be further enhanced are not available to the application software in the initial 
by grouping data groups from one or from different configuration of the parallel set. 
redundancy groups into a common logical unit, referred 15 The redundancy group configuration of the parallel 
to herein as an application unit. Such application units set is illustrated by a two dimensional address space 808 
can thus appear to the application software of an operat- comprising the entire memory space of the parallel set' 
mg system as a single logical mass storage unit combin- The horizontal axis of address space 808 represents the 
ing the different operating characteristics of various thirteen physical drives of the parallel set, including the 
data groups. Moreover, the use of such application units 20 twelve drives of active set 528 and the one spare drive 
permits data groups and redundant groups to be config- of backup set 530, In FIG. 24, the drives of the active 
ured as desired by a system operator independent of any set are numbered 0-11 respectively to reflect their logi- 
particular storage architecture expected by application cal positions in the parallel set. The vertical axis of 
software. This additional level of logical grouping, like address space 808 represents the sectors of each physi- 
the redundancy group and data group logical levels, is 25 cal drive. As shown by redundancy group address space 
controlled by second level controller 14. 808, the parallel set has been configured as one redun- 
FIG. 24 illustrates an example of how application dancy group RG0 having three extents A, B and C As 
units, data groups and redundancy groups might be can be seen, the width of each extent is equal to that of 
mapped to a device set such as parallel set 501 or 502, at the redundancy group RG0: 12 logical drive positions 
initialization of the parallel set. 30 or, from another perspective, the entire width of active 

Referring first to the linear graph 800 of logical unit set 528. 
address space, this graph represents the mass data stor- Extent A of redundancy group RG0 includes sectors 
age memory of the parallel set as it would appear to the 1-5 of drives 0-11. Thus, extent A of redundancy group 
application software of a CPU operating system. In the RG0 has a width of 12 spindles, and an extent depth of 
particular example of FIG. 24, the parallel set has been 35 5 sectors. In the example of FIG. 24, extent A is pro- 
configured to provide a logical unit address space com- vided as memory space for diagnostic programs associ- 
prising two application (logical) units (LUNO and ated with mass storage system 500. Such diagnostic 
LUN1). Logical unit LUNO is configured to include 20 programs may configure the memory space of extent A 
addressable logical blocks having logical block numbers in numerous ways, depending on the particular diagnos- 
LBN0-LBN19. As shown by FIG. 24, logical unit 40 tic operation being performed. A diagnostic program 
LUNO also includes an unmapped logical address space may, for example, cause a portion of another extent to 
802 that is reserved for dynamic configuration. Dy- be reconstructed within the boundaries of extent A, 
namic configuration means that during run-time of the including application data and check data, 
parallel set the CPU application software can request to Extent B of redundancy group RG0 includes all ap- 
change the configuration of the parallel set from its 45 plication data stored on the parallel set. More particu- 
initial configuration. In the example of FIG. 24, un- larly, in the example of FIG. 24, extent B includes data 
mapped spaces 802 and 804 are reserved respectively in groups Dl, D2 and D3 configured as shown in FIG. 22, 
each of logical units LUNO and LUN1 to allow a data as well as additional memory space reserved for data 
group to be added to each logical unit without requiring groups (D4) and (D5), and a region 809 of memory 
that either logical unit be taken off line. Such dynamic 50 space not mapped to either logical unit LUNO or 
configuration capability can be implemented by provid- LUN1. This region 809 may, for example, be mapped to 
ing a messaging service for a CPU application to re- another logical unit (e.g., LUN2) being used by another 
quest the change in configuration. On behalf of mass application. 

storage system 500, the messaging service can be han- Address space 808 also includes a third extent C in 

died, for example, by the first level controllers 12. Logi- 55 which a second diagnostic field may be located Ai- 

cal unit LUN1 includes a plurality of addressable logi- though the parallel set is shown as including only a 

cal blocks LBNO-LBN179 and LBN200-LBN239. The single redundancy group RG0, the parallel set may 

logical blocks LBN180-LBN199 are reserved for dy- alternatively be divided into more than one redundancy 

namic configuration, and m the initial configuration of group. For example, redundancy group RG0 might be 

the parallel set, as shown in FIG. 24, are not available to 60 limited to a width of 8 spindles including logical drive 

the application software, positions 0-7, such as is shown in FIGS. 22 and 23 and 

The mass storage address space of logical unit LUNO a second redundancy group might be provided for logi- 

comprises a single data group Dl, as shown by data cal drive positions 8-11. 

group address space chart 806. Data group Dl includes It is also not necessary that the entire depth of the 

20 logically contiguous data blocks 0-19, configured as 65 parallel set be included in redundancy group RG0 As 

shown in FIG. 22 and corresponding one to one with an example, FIG. 24 shows that above and below re- 

logical block numbers LBNO-LBN19. Logical unit dundancy group RG0 are portions 810 and 811 of mem- 

LUN1 includes two data groups D2 and D3, compris- ory space 808 that are not included in the redundancy 
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group. In the example of FIG. 24, portions 810 and 811 
contain data structures reflecting the configuration of 
the parallel set. These data structures are described in 
greater detail below in connection with FIG. 25. In 
addition, any portion of memory space between set 
extents A, B and C, such as the portions indicated by 
regions D and E in FIG. 24, may be excluded from 
redundancy group RGO. 

FIG. 24 further provides a graph 812 showing a lin- 
ear representation of the physical address space of the 
drive in logical position 0. Graph 812 represents a sec- 
tional view of address space chart 810 along line O'-O", 
and further illustrates the relationship of the various 
logical levels of the present invention as embodied in 
the exemplary parallel set configuration of FIG. 24. 

As stated previously, the parallel set can be config- 
ured by the operator initially at installation time and/or 
during run-time of the parallel set. The operator formats 
and configures the application units he desires to use by 
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extent. Depth and width together are the 
dimensions respectively of the side and the top 
of the rectangle formed by each data group as 
shown in FIGS. 22-24. 

The name of the redundancy group to which the 
data group belongs. 

A name or number identifying the extent in 

which the data group is located. 

The configuration utility will assign a number 

to each data group, unique within its redundancy 

group. This number will be used to identify the 

data group later, for the format utility and at 

run-time. 

The depth, in sectors, of logically contiguous 
blocks of data within each stripe of data in the 
data group. 



Redundancy 

Group: 

Extent 

Number: 

Index: 



Data 
Group 

15 Stnpc 
Depth: 

3) For each application unit: 

Size: Size in sectors 

Data . A list of the data groups, and their size and 

Group order, within the unit address space, and the 
first determining the' capacity, performance and redun- 20 Li* unit logical address of each data group. 

r r Each group is identified by the name of the 

redundancy group it is in and its index. 



dancy requirements for each unit. These considerations 
have been previously discussed herein. Once the capac- 
ity, performance and redundancy requirements have 
been defined, the logical structure of the units can be 
specified by defining parameters for each of the logical 
layers (redundancy group layer, data group layer and 
application unit layer). These parameters are provided 
to a configuration utility program executed by proces- 
sor S\6a of second level controller 14. The configura- 



FIG. 25 illustrates exemplary data structures contain- 
25 ing the above-described parameters that can be used in 
implementing the configuration database of a device set 
such as parallel set 501 or 502. These data structures 
may be varied as desired to suit the particular device set 



embodiment to which they are applied. For example, 
tion utility manages a memory resident database of 30 the data structures described hereafter allow for many 

- - options that may be unused in a particular device set, in 

which case the data structures may be simplified. 

The configuration database includes an individual 
unit control block (UCB) for each application unit that 
35 references the parallel set (a unit may map into more 
than one parallel set). These UCB's are joined together 
in a linked list 900. Each UCB includes-a field labeled 
APPLICATION UNIT # identifying the number of 
the application unit described by that UCB. Alterna- 
40 tively, the UCB's within link list 900 might be identified 
by a table of address pointers contained in link list 900 
or in some other data structure in the program memory 
of microprocessor 516c. Bach UCB further includes a 
map 901 of the data groups that are included in that 
45 particular application unit. Data group map 901 in- 
cludes a count field 902 defining the number of data 
groups within the application unit, a size field 904 defin- 
ing the size of the application unit in sectors, and a type 
field 906 that defines whether the linear address space of 
50 the application unit is continuous (relative, addressing) 
or non-continuous (absolute addressing). A non-con- 
tinuous address space is used to allow portions of the 
application unit to be reserved for dynamic configura- 
tion as previously described in connection with data 
55 groups (D4) and (D5) of FIG. 22, 

Data group map 901 further includes a data group 
mapping element 908 for each data group within the 
application unit. Each data group mapping element 908 
includes a size field 910 defining the size in sectors of the 
60 corresponding data group, a pointer 912 to a descriptor 
block 914 within a data group list 916, a pointer 718 to 
an array control block 720, and an index field 721. The 
data group mapping elements 908 are listed in the order 
in which the data blocks of each data group map to the 
65 LBN's of the application unit. For example, referring to 
LUN1 of FIG. 24, the mapping element for data group 
D3 would be listed before the data group mapping 
element for data group D2. Where the address space of 



configuration information for the parallel set. Prefera- 
bly, a copy of this database information is kept in non- 
volatile memory to prevent the information from being 
lost in the event of a power failure affecting the parallel 
set. A format utility program executed by processor 
516a utilizes the information in this database as input 
parameters when formatting the physical drives of the 
parallel set as directed by the operator. 

The basic parameters defined by the configuration 
database preferably include the following: 

1) For each redundancy group: 
Type: Mirrored; 

Two check drives: 

One check drive; 

No check drive. 
Width: The number of logical drive positions as 

spindles in the redundancy group. 
Extent For each extent of the redundancy group, 

Size: the size (depth) of the extent in sectors 

Extent For each extent of the redundancy group 

Base: the physical layer address of the first 

sector in the extent. 
Stripe For interleaved check drive groups, the depth, 

in Depth: sectors of a stripe of check data. 

Drives: An identification of the physical drives 

included in the redundancy group. 
Name: Each redundancy group has a name thai is 

unique across the mass storage system 500. 

2) For each data group: 

Base: The index (logical drive number) of the drive 

position within the redundancy group that is the 
first drive position in the data group within 
the redundancy group. 

Width: The number of drive positions (logical drives) 

in the data group. This is the number of 
sectors across in the data group address space. 

Start: The offset, in sectors, within the redundancy 

group extent where the data group rectangle 
begins on the logical drive position identified 
by the base parameter. 

Depth: The number of sectors in a vertical column of 

the data group, within the redundancy group 
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t i C im P , P T '! non - continuou s. as in the case of the extent (in which case, the extent will have an equal 

L.UN1 of FIG. 24, data group map 901 may include number of data drives and redundant drives). Alterna- 

mapping elements corresponding to, and identifying the lively, a Reed-Solomon coding algorithm may be used 

S1 n a °,o 8ap , S b f V ? ,laWe ran8CS 0f LBN ' S t0 B merate check data «« one drive for each redun- 

f«r.o^,!, TO . UP " 9l6 - L" ClU v " 8 to***" block 914 5 danc V group stripe within the extent, or a more sophisti- 

for each data group within the parallel set, and provides cated Reed-Solomon coding algorithm may be used to 

parameters for mapping each data group to the redun- generate two drives of check data for each redundancy 

dancy group and redundancy group extent in which it is group stripe. Type field 760 may specify also whetheV 

located. Data group list 916 includes a count field 717 the check data is to be striped throughout* Textent 

identifying the number of descriptor blocks in the list. 10 and how it is to be staggered (e.g., the type field might 

JSL if ^ I redundanl =y group having a striped index a series of standardized check data patterns, such 

SSToA conf, 8" rat,on : i M f h v da ; a g roup d «criptor as a pattern in which check data for the first redun- 

X£5£? T^h" 3 W r f 'u d ?* that defineS dancy grou P stri P e in the extent is >««ed on the two 
he offset of the first data b ode of the data group from numerically highest logical drive positions of the redun. 
the beginning of the check data for the redundancy 15 dancy group, check data for the second redundancy 

SZfJffi . r i!T C ,, UdeS ,h K f,rS ' data bl0Ck - The group stri P e in ,he extem is located °n *e next two 
value of pqdel field 722 may be positive or negative. numerically highest logical drive positions, and so on) 
depending on the relative positions of the drive on Yet another alternative is that type field 760 indicates 
which the first data block of the data group is config- that no check drives are included in the initial config- 
ured and the corresponding check data drives for the 20 ration of the redundancy group extent. This mav be 
redundancy group stripe including that first data block. desired, for example, if the redundancy group extent is 
This value can be useful for ass.st.ng the second level created for use by diagnostic programs. A redundancy 
controller in determining the position of the check data group extent of this type was previously discussed in 
during I/O operations. connection with extent A of redundancy group RG0 
Each data group descriptor block 914 also includes an 25 shown in FIG. 24 

r^*™ d I" ( r m .f £l ue 35 indeX fleld 721) ' 8 wid,h Each ex,em descriptor block 746 may further include 
field 724 a base field 726, an extent number field 727, a a redundancy group stripe depth field 762 to specify if 
T T lfr t £ " ** ,h 730 ' 8 data S rou P stripe appropriate, the depth of redundancy group stripes 
depth field 731 and a redundancy group name field 732 within the extent. 

that respectively define values for the corresponding 30 List 744 of physical drive identifier blocks 745 in- 
parameters previously discussed herein. dudes an identifier block 745 for each physical drive in 

Array control block 720 provides a map of redun- the parallel set. Each identifier block 745 provides in- 
dancy groups of the parallel set to the physical address formation concerning the physical drive and its present 
space of the drives comprising the parallel set. Array operating state, and includes in particular one or more 
control block 720 includes an array name field 734 and 35 fields 764 for defining the logical position in the parallel 
one or more fields 735 that uniquely identify the present set of the corresponding physical drive 
configuration of the parallel set. Array control block To summarize briefly the intended functions of the 
720 also includes a list of redundancy group descriptor various data structures of FIG 25 the unit control 
blocks 736. Each redundancy group descriptor block blocks of link list 900 define the mapping of application 
736 includes a redundancy group name field 738 identi- 40 units to data groups within the parallel set. Mapping of 
fying the redundancy group corresponding to the de- data groups to redundancy groups is defined by data 
senptor block, a redundancy group width field 740 and group list 916, and mapping of redundancy croups to 
a redundancy group extent map 742. Array control the physical address space of the memory of the parallel 
block 720 further includes a list 744 of physical drive set is defined by amy control block 720 
identifier blocks 745 45 When each physical disk of the parallel set is format- 

For each extent withm the redundancy group, redun- ted by the formatting utility, a copy of the array control 
dancy group extent map 742 includes an extent descrip- block 720, link list 900 and data group list 916 are stored 
tor block 746 containing parameters that map the extent on the drive. This information may be useful for various 
to corresponding physical address m the memory space operations such as reconstruction of a failed drive A 
of the parallel set, and define the-configuration of re- 50 copy of the configuration database also may be written 
dundant information in the extent. As an example, ex- to the controller of another parallel set. such that if one 
tent descriptor blocks are shown for the three extents of parallel set should fail, another would be prepared to 
redundancy group RG0 of FIG. 24, each extent de- take its place. 

scriptor block including an extent number field 747 and During each I/O request to a parallel set the map 
base and size fields defining the physical addresses of 55 ping from unit address to physical address spaces must 
the corresponding extent. Application data base and be made. Mapping is a matter of examining the confieu- 
size fields 748 and 750 correspond respectively to the ration database to translate: (1) from a unit logical ad- 
base and size of extent B of redundancy group RG0; dress span specified in the I/O request to a sequence of 
diagnostic (low) base and size fields 752 and 754 corre- data group address spans; (2) from the sequence of data 
spond respectively to the base and size of extent A of 60 group address spans to a set of address spans on loeical 
redundancy group RG0; and diagnostic (high) base and drive positions within a redundancy group- and then (3) 
size fields 756 and 758 correspond respectively to the from the set of address spans on logical drive positions 
base and size of extent C of redundancy group RG0. to actual physical drive address spans. This mapping 
r £* C , J? ! descriptor block 746 also includes a type process can be done by having an I/O request server 
field 760 that defines the type of redundancy imple- 65 step through the data structures of the configuration 
mented m the extent. For example, a redundancy group database in response to each I/O request. Alternatively 
extent may be implemented by mirroring or shadowing during initialization of the parallel set the confiscation 
the mass storage data stored in the data group(s) within utility may, in addition to generating the configuration 
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database as previously described, generate subroutines start field 728 will indicate that there is an offset of 10 

for the I/O request server for performing a fast map- sectors on logical drive 0 between the beginning of 

ping function unique to each data group. The particular extent B and the first data block of data group D3 

manner in which the I/O request server carries out the Knowing the logical drive position and extent offset 

mapping operations is implementation specific, and it is 5 of the first data block of the data group, the I/O request 

believed to be withm the skill of one in the art to imple- server then determines the logical drive positioVand 

ment an I/O request server in accordance with the extentoffset for each sequence ofdatablockVin the data 

present invention as the invention is described herein. g^p corresponding to the LBN's of the I/O request. 

The followmg is an example of how the I/O request T o do this, the I/O request server may use the valL of 

server rroght use the data structures of FIG. 25 to map 10 width field 724> dep ^ field 730 and data group stride 

from a logical unit address span of an application I/O depth fie]d 731 . If £ check data fa inc]ud J* 

X V 1 nS.°^f P Ttr v". the f h ^ 1Cal addfeSS ^angular boundaries of the data JS^ST^* on 

space of a parallel set. The logical unit address scan is «r * * i - . VI yv^uw 

assumed to be defined in the I/O request by a ffi SLi2S t £?«" i n" T"' ^ m 
application unit number and one or more LBN's wkhin 15 ?hh " 8 f £ °* ?"m *? slU ™* nd " tem offset 
that application unit. add : es ? spans °/ the data blocks. This can be accom- 

The I/O request server determines from the I/O fl T^^T^ ^ arra y contr ^ Wock 720. 
request the application unit being addressed and M°re particularly, the I/O request server can determine 
whether that application unit references the parallel set. £ l0g * al d ™ ^T" 0ffSCt ° f My Check 

This latter determination can be made by examining link 20 fi^i^ grOUp by 

list 900 for a UCB having an APPLICATION UNIT # mg ? V^«V J , and ^ redundancy group stnpe 
corresponding to that of the I/O request. If an appropri- dCpth , . of t the a PP ro P na te redundancy group 
ate UCB is located, the I/O request server next deter- !* tent descn P|°[ oIock 74 « < th e I/O request server can 
mines from the LBN(S) specified in the I/O request the determine which extent descriptor block 746 is appro- 
data group or data groups in which data block(s) corre- 25 pnate by findmg the cxtent descriptor block 746 having 
spending to those LBN(S) are located. This can be f n «tent number field 747 that matches the correspond- 
accomplished by comparing the LBN(S) to the size ing extent number ficld 727 in the data group's descrip- 
fields 910 of the mapping elements in data group map tor block 914 >* The 1/0 request server is directed to 
901, taking into account the offset of that size field from array contro . 1 bIock 720 by the pointer 718 in the data 
the beginning of the application unit address space (in- 30 grou P ma PP in S clement 908. 

eluding any gaps in the application unit address space). To translate each logical drive position and extent 
For example, if the size value of the first data group ° ffset addr «s span to a physical address span on a par- 
mapping element in map 901 is greater than the LBN(s) ticular physical drive of the parallel set, the I/O request 
of the I/O request, then it is known that the LBN(s) server reads the physical drive identifier blocks 745 to 
correspond to data blocks in that data group. If not, 35 determine the physical drive corresponding to the iden- 
then the size value of that first mapping element is tified logical drive position. The I/O request server also 
added to the size value of the next mapping element in read s the base field of the appropriate extent descriptor 
map 901 and the LBN(s) are checked against the result- block 746 of arra y control block 720 (e.g., application 
ing sum. This process is repeated until a data group is Dase fieId 752), which provides the physical address on 
identified for each LBN in the I/O request. 40 trie drive of the beginning of the extent. Using the extent 

Having identified the appropriate data group(s), the offset address span previously determined, the I/O re- 
I/O request server translates the span of LBN's in the °. uest server can then determine for each physical drive 
I/O request into one or more spans of corresponding f ne s P an of physical addresses that corresponds to the 
data block numbers within the identified data group(s). identified extent offset address span. 
The configuration utility can then use the value of index 45 I* mav occur that during operation of a parallel set 
field 921 and pointer 912 within each mapping element one or m ore of the physical drives is removed or fails, 
908 corresponding to an identified data group to locate sucn that the data on the missing or failed drive must be 
the data group descriptor block 914 in data group list reconstructed on a spare drive. In this circumstance, the 
916 for that data group. The I/O request server uses the configuration of the set must be changed to account for 
parameters of the data group descriptor block to trans- 50 the new drive, as well as to account for temporary set 
late each span of data block numbers into a span of changes that must be implemented for the reconstruc- 
logical drive addresses. tion period during which data is regenerated from the 

First, the I/O request server determines the logical missing or failed drive and reconstructed on the spare, 
drive position of the beginning of the data group from It is noted that the configuration utility can be used to 
the base field 726 of the data group descriptor block 55 remap the set configuration by redefining the parame- 
914. The I/O request server also determines from fields ters of the configuration database. 
732 and 727 the redundancy group name and extent In general, to those skilled in the art to which this 
number in which the data group is located, and further invention relates, many changes in construction and 
determines from start field 728 the number of sectors on widely differing embodiments and applications of the 
the drive identified in base field 726 between the begin- 60 present invention will suggest themselves without de- 
ning of that redundancy group extent and the beginning parting from its spirit and scope. For instance, a greater 
of the data group. Thus, for example, if the I/O request number of second level controllers and first level con- 
server is reading the descriptor block for data group D3 trollers may be implemented in the system. Further, the 
configured as shown in FIG. 24, base field 726 will structure of the switching circuitry connecting the sec- 
indicate that the data group begins on logical drive 65 ond level controllers to the disk drives may be altered 
position 0, redundancy name field 732 will indicate that so that different drives are the primary responsibility of 
the data group is in redundancy group RG0, extent field different second level controllers. Thus, the disclosures 
727 will indicate that the data group is in extent B, and and descriptions herein are purely illustrative and not 
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intended to be in any sense limiting. The scope of the 
invention is set forth in the appended claims. 
What is claimed is: 

1. A system for storing data received from an external 
source, comprising: 

at least two control means for providing control of 
data flow to and from the external source; 

a plurality of storage means coupled to said at least 
two control means wherein said storage means are 
divided into groups and each group is controlled 
by said at least two of said control means such that 
in the case that a first control means coupled to a 
particular group of storage means fails, control of 
said particular group is assumed by a second con- 
trol means; 

a plurality of data handling means coupled to said at 
least two control means for disassembling data into 
data blocks to be written across a group of said 
storage means; and 

error detection means coupled to said control means 
and said storage means for calculating at least one 
error detection term for each group of storage 
means based on the data received from the external 
source using a selected error code and providing 
said error detection term to be compared with data 25 
to detect errors, said error detection means being 
coupled to each of said control means to receive 
the data from said control means and transmit said 
error detection term to an error code storage 
means in said group of storage means. 

2. The system of claim 1 wherein said data handling 
means further include assembly means for assembling 
said data blocks received from said control means. 

3. The system of claim 1 further comprising a first bus 
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means are connected to each of the storage means, a 
method for storing data received from the external 
source comprising the steps of: 

receiving data from the external source; 

configuring the plurality of storage means into 
groups wherein each group is initially controlled 
by at least two of the control means such that in the 
case that one of the control means fails, the storage 
means of each group is accessible through another 
one of the control means; 

disassembling data into groups of data blocks to be 
written to said plurality of storage means; 

calculating at least one error detection term from said 
data using a selected error code; 

storing said data blocks in a first of said groups of 
storage means; and 

storing said at least one error detection term in said 
first of said groups of storage means. 

10. The method of claim 9 further comprising the 
steps of: 

retrieving said data blocks from said first of said 

groups of storage means; 
calculating a check error detection term from said 

data blocks using a selected error code; 
retrieving said at least one error detection term from 

said first of said groups of storage means; and 
comparing said check error detection term to said at 

least one error detection term to determine that 

said data has not been corrupted. 

11. The method of claim 10 further comprising the 
step of correcting said data if it is determined that said 
data has been corrupted. 

12. The method of claim 10 further comprising the 



o, i nc byMcm oi ciaim i iunner comprising a nrst bus *" c UICinuu U1 ciaim iu iunner comprising the 

for coupling to said external source and a plurality of 35 ste P of assembling said data blocks into a form in which 

buffer means coupled to said first bus and to said control lX was received from the external source, 

means for buffering data received by and transmitted 13 Tne method of claim 9 wherein the step of config* 

from the system. uring further comprises the step of setting a plurality of 

4. The system of claim 3 further comprising error switching means to allow said data blocks to be passed 

correction means coupled to said error detection means 40 between the control means and the storage means in a 

for correcting error in data as said data is transmitted predefined pattern. 

from either said buffer means to said storage means 1*- The method of claim 9 further comprising the step 

through said data handling means or from said storage of detaching a particular storage means upon which said 

means to said buffer means through said data handling data Wfl s stored if it is determined that said data has been 

45 corrupted. 

15. A system for storing data received from an exter- 



means 

5. The system of claim 3 wherein the error detection 
means uses a Reed-Solomon error code to detect errors 
in said data received from said buffer means and said 
storage means. 

6. The system of claim 3 wherein said data handling 50 
means includes detachment means coupled to said error 
detection means for detaching from the system storage 
means and buffer means which transmit erroneous data 
responsive to receiving said error detection term from 
said error detection means. 55 

7. The system of claim 1 wherein said plurality of ' 
storage means comprises: 

a first group of data storage means for storing data 

from the external source; and 
a second group of error check and correction (ECC) 60 

storage means for storing ECC data generated by 

said error detection means. 

8. The system of claim 1 wherein each of said plural- 
ity of storage means stores data and error check and 
correction (ECC) data in a predefined pattern. 65 

9. In a system including at least two control means for 
communicating with an external source and a plurality 
of storage means wherein at least two of the control 



nal source, comprising: 
control means for providing control of data flow to 

and from the external source; 
a plurality of storage means coupled to said control 
means wherein said storage means are divided into 
groups; 

a plurality of data handling means coupled to said 
control means for disassembling data with data 
blocks to be written- to said storage means; and 

error detection means coupled to said control means 
for receiving said data blocks in parallel form and 
detecting errors in each data block substantially 
simultaneously as said data blocks are written, to 
said storage means. 

16. The system of claim 15 further comprising data 
correction means coupled to said error detection means 
for correcting corrupted data in response to an error 
detection signal provided by said error detection means. 

17. The system of claim 16 wherein said plurality of 
storage means comprises: 

a first group of data storage means for storing data 
from the external source; and 
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a second group of error check and correction (ECC) 
storage means for storing ECC data generated by 
said error correction means. 

18. The system of claim 16 wherein the error detec- 
tion means uses a Reed-Solomon error code to detect 
errors in said data received from said buffer means and 
said storage means. 

19. The system of claim 16 wherein the error correc- 
tion means uses a Reed-Solomon error code to correct 
errors in said data received from said buffer means and 
said storage means. 

20. The system of claim 15 further comprising de- 
tachment means coupled to said error detection means 
for detaching a particular storage means from the sys- 
tem which has provided corrupted data as determined 
by said error detection means. 

21. The system of claim 15 wherein said data handling 
means further includes assembly means for assembling 
said data blocks received from said control means. 

22. The system of claim 15 further comprising a first 
bus for coupling to said external source and a plurality 
of buffer means coupled to said first bus and to said 
control means for buffering data received by and trans- 
mitted from the system. 

23. The system of claim 15 wherein each of said plu- 
rality of storage means stores data and error check and 
correction (ECC) data in a predefined pattern. 

24. In a system including control means for communi- 
cating with an external source and a plurality of storage 
means, a method for storing data received from the 
external source comprising the steps of: 

receiving data from the external house; 
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disassembling the data into groups of data blocks to 

be written to said plurality of storage means; 
calculating at least one error detection term for each 

data block substantially simultaneously; and 
5 storing said data blocks and said at least one error 

detection term in a first of said groups of storage 

means substantially simultaneously. 

25. The method of claim 24 further comprising the 
steps of: 

10 retrieving said data blocks from said first of said 
groups of storage means; 
calculating a check error detection term from said 

data blocks using a selected error code; 
retrieving said at least one error detection term from 
15 said first of said groups of storage means; and 
comparing said check error detection term to said at 
least one error detection term to determine that 
said data has been not corrupted. 

26. The method of claim 25 further comprising the 
20 step of correcting said data if it is determined that said 

data has been corrupted. 

27. The method of claim 25 further comprising the 
step of assembling said data blocks into a form in which 
it was received from the external source. 

25 28. The method of claim 24 wherein the step of con- 
figuring further comprises the step of setting a plurality 
of switching means to allow said data blocks to be 
passed between the control means and the storage 
means in a predefined pattern. 

30 29. The method of claim 24 further comprising the 
step of detaching a particular storage means upon 
which said data was stored if it is determined that said 
data has been corrupted. 

***** 
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ABSTRACT 



A data storage system includes a plurality of control/data 
buses. A memory section is coupled to the plurality of 
control/data buses. The memory section includes a memory 
and a plurality of control logic sections interconnected 
through an arbitration bus. Each one of the control logic 
sections is coupled between a corresponding one of the 
control/data buses and the memory. Each one of such control 
logic sections includes a control logic for controlling trans- 
fer of data between the memory and the one of the plurality 
of control/data buses coupled to said one of the logic 
sections. The control logic is adapted to produce a control/ 
data bus request for the one of the control/data buses coupled 
thereto and is adapted to effect the transfer in response to a 
control/data bus grant fed to the control logic. Each one of . 
the control logic sections also includes a bus arbitration 
section coupled to the arbitration bus. Each one of the bus 
arbitration sections is adapted to: (1) receive a control/data 
bus request from the control logic in such one of the control 
logic sections and from the other control logic sections 
coupled to such arbitration bus; (2) grant access to the 
control/data bus to one of the control logic sections in 
accordance with the control/data bus requests coupled to the 
bus arbitration section; (3) receive control/data bus grants 
from the other control logic sections coupled to such arbi- 
tration bus; and (4) distribute the control/data bus request 
produced by the control logic in said control logic section to 
the other control logic sections coupled to the arbitration 
bus. The bus arbitration section has fault tolerance. 

12 Claims, 9 Drawing Sheets 
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BUS ARBITRATION SYSTEM 

BACKGROUND OF THE INVENTION 

This invention relates generally to data storage systems, 
and more particularly to data storage systems having redun- . 
dancy arrangements to protect against total system failure in 
the event of a failure in a component or subassembly of the 
storage system. 

As is known in the art, large mainframe computer systems 10 
require large capacity data storage systems. These large 
main frame computer systems generally includes data pro- 
cessors which perform many operations on data introduced 
to the computer system through peripherals including the 
data storage system. The results of these operations are ^ 
output to peripherals, including the storage system. 

One type of data storage system is a magnetic disk storage 
system. Here a bank of disk drives and the main frame 
computer system are coupled together through an interface. 
The interface includes CPU, or "front end", controllers and 20 
"back end" disk controllers. The interface operates the 
controllers in such a way that they are transparent to the 
computer. That is, data is stored in, and retrieved from, the 
bank of disk drives in such a way that the mainframe 
computer system merely thinks it is operating with one 25 
mainframe memory. One such system is described in U.S. 
Pat. No. 5,206,939, entitled "System and Method for Disk 
Mapping and Data Retrieval", inventors Moshe Yanai, 
Natan Vishlitzky, Bruno Alterescu and Daniel Castel, issued 
Apr. 27, 1993, and assigned to the same assignee as the 30 
present invention. 

As described in such U.S. Patent, the interface may also 
include, in addition to the CPU controllers and disk 
controllers, addressable cache memories. The cache memory 
is a semiconductor memory and is provided to rapidly store 35 
data from the main frame computer system before storage in 
the disk drives, and, on the other hand, store data from the 
disk drives prior to being sent to the main frame computer. 
The cache memory being a semiconductor memory, as 
distinguished from a magnetic memory as in the case of the 40 
disk drives, is much faster than the disk drives in reading and 
writing data. 

The CPU controllers, disk controllers and cache memory 
are interconnected through a backplane printed circuit 
board. More particularly, disk controllers are mounted on 45 
disk controller printed circuit boards. CPU controllers are 
mounted on CPU controller printed circuit boards. And, 
cache memories are mounted on cache memory printed 
circuit boards. The disk controller, CPU controller and cache 
memory printed circuit boards plug into the backplane 50 
printed circuit board. In order to provide data integrity in 
case of a failure in a controller, the backplane printed circuit 
board lias a pair of buses. One set of the disk controllers is 
connected to one bus and another set of the disk controllers 
is connected to. the other bus. Likewise, one set of the CPU 55 
controllers is connected to one bus and another set of the 
CPU controllers is connected to the other bus. The cache 
memories are connected to both buses. Each one of the buses 
provides data, address and control information. 

Thus, the use of two buses provides a degree of redun- 
dancy to protect against a total system failure in the event 
that the controllers, or disk drives connected to one bus fail. 

SUMMARY OF THE INVENTION 

65 

In accordance with the present invention, a data storage 
system is provided. The data storage system includes a 
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plurality of control/data buses, A memory section is coupled 
to the plurality of control/data buses. The memory section 
includes a memory and a plurality of control logic sections 
interconnected through an arbitration bus. Each one of the 
control logic sections is coupled between a corresponding 
one of the control/data buses and the memory. Each one of 
such control logic sections includes a control logic for 
controlling transfer of data between the memory and the one 
of the plurality of control/data buses coupled to said one of 
the logic sections. The control logic is adapted to produce a 
control/data bus request for the one of the control/data buses 
coupled thereto and is adapted to effect the transfer in 
response to a control/data bus grant fed to the control logic. 
Each one of the control logic sections also includes a bus 
arbitration section coupled to the arbitration bus. Each one 
of the bus arbitration sections is adapted to: (1) receive a 
control/data bus request from the control logic in such one 
of the control logic sections and from the other control logic 
sections coupled to such arbitration bus; (2) grant access to 
the control/data bus to one of the control logic sections in 
accordance with the control/data bus requests coupled to the 
bus arbitration section; (3) receive control/data bus grants 
from the other control logic sections coupled to such arbi- 
tration bus; and (4) distribute the control/data bus request 
produced by the control logic in said control logic section to 
the other control logic sections coupled to the arbitration 
bus. 

In accordance with another feature of the invention, each 
one of the bus arbitration sections includes a majority gate 
fed by the control/data bus grants received from the other 
control logic sections for producing an internal control/data 
bus grant when a majority of the control logic sections 
indicate that the said one of the bus arbitration sections has 
been granted the control/data bus. 

In accordance with another feature of the invention, each 
one of the bus arbitration sections includes an internal 
arbitrator response to control/data bus request from the 
plurality of control logic sections and provides a control/data 
bus grant to one of the plurality of control logic sections 
selectively in accordance with a pre -determined criteria. 

In accordance with another feature of the invention, each 
one of the bus arbitration sections includes an internal 
arbitrator response to control/data bus request from the 
plurality of control logic sections and wherein each one of 
the plurality of control logic sections provides a control/data 
bus grant to one of the plurality of control logic sections 
selectively in accordance with a common pre-determined 
criteria. 

BRIEF DESCRIPTION OF THE DRAWING 

These and other features of the invention will become 
more readily apparent when read together with the accom- 
panying drawings, in which: 

FIG. 1 is a block diagram of a computer system using a 
data storage system in accordance with the invention; 

FIG. 2 is a block diagram of an exemplary one of a 
plurality of cache memories used in the system of FIG. 1; 

FIG. 3 is a block diagram of a control logic section used 
in the exemplary one of the cache memories of FIG. 2; 

FIG. 4 is a block diagram of a bus arbitration section used 
in the control logic section of FIG. 3; 

FIG. 5 is a block diagram showing a plurality of the 
control logic sections of FIG. 3 interconnected through an 
arbitration bus; 

FIG. 5A-5D are block diagrams showing the plurality of 
the control logic sections of FIG. 3 interconnected through 
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an arbitration bus as shown in FIG. 5, FIGS. 5A-5D RAM region B. The third row includes ASICs CA; C,B; 

showing different examples of data on wires of the arbitra- C,C; and C,D each being coupled to the RAM region C. The 

tion bus, a logic 1 condition being indicated in FIGS. 5A-5B fourth row includes ASICs DA; D,B; D,C; and D,D each 

by heavy weight (i.e., darker) lines. being coupled to the RAM region D. Further, the control 

5 logic sections ASIC A A • • . ASIC D,D in each of the four 

DESCRIPTION OF THE PREFERRED rows thereof are interconnected through an arbitration bus. 

EMBODIMENTS Thus, the first row of ASICs AA A,B; AC; and A,D are 

n r . riip 1 , . 1A . , interconnected by arbitration bus AB,. The second row of 

Referring now to FIG. 1, a computer system 10 is shown. A CTO r, A t> r» o o , n t . 1 - , , 

* *m*ij - c- * ASICs BA B,B; B,C; and B,D are interconnected by 

The computer system 10 includes a main frame computer ... V* ' ' ' . , - ACir% ^ . „ _ ^ * 

„ # . S , . . e AAC 10 arbitration bus AB 2 . The third row of ASICs CA C,B; C,C; 

section 12 havmg main frame processors 14 for processing JU j ■ ♦ ♦ au u * u *£l 

j . n % >u a a ♦ . a • a an d C,D are interconnected by arbitration bus AB.. The 

data. Portions of the processed data are stored m, and f ' - ~ A ~ u ' „ ■ ? 

aa* c u ^ i& c \ a - u fourth row of ASICs D,A; D,B; D,C; and D,D are mtercon- 

retneved data from, a bank 16 of disk drives through an . , . ( . u A A • „ J ™. . t 

interface 18 nected by arbitration bus AB 4 , as indicated. The intercon- 
nection of the control logic sections (i.e. ASICs A A AD; 

Hie interface 18 includes disk controllers 20, central J5 B A-B,D; CA-C,D; and, D,A-D,D in each of the rows 

processor unit (CPU) controllers 22 and addressable cache thereof through the one of the four arbitration buses 

memories 24a, 24b, 24c, and 244 electrically interconnected AB 1 -AB 4 , respectively, will be described in more detail in 

through a backplane 25, here four control/data buses; i.e., an connection with FIG. 5 for an exemplary row, here the first 

A bus, a B bus, a C bus, and a D bus, as shown. The cache row with ^ lCs A A . AB; ^ c and AJD Suffice it t0 say 

memories 24a, 24b, 24c and 24a" are hereinafter sometimes 20 here) however, that each one of the RAM regions (i.e., RAM 

referred to as memory section 24a, 24b, 24c and 24d, regions A _ D) is coupled between to the four control/data 

respectively. buses ^ A bus ^ g bus? c bus and D bus ) through a 

More particularly, in order to provide data integrity in corresponding one of the four control logic sections ASICs 

case of a failure in a disk controller 20 or CPU controller 22, AA . . . D,D in the one of the rows of control logic sections 

the four of control/data buses (i.e., A bus, B bus, C bus and 25 ASICs AA • . • DJD coupled to such RAM region, as 

D bus) are provided. One portion of the disk controllers 20 described above. 

is connected to one of the A bus, a second portion to the B Each one of the four columns 0 f control logic section 

bus, a third portion to the C bus and the remaining portion A siCs is coupled to a coupled to a corresponding one of the 

to the D bus. Likewise, one portion of the CPU controllers control/data buses. More particularly, a first column of 

22 is connected to the A bus, a second portion to the B bus, 30 control logic sections (i.e., ASICs AA; B,A <A and D,A) 

a third portion to the C bus and the remaining portion to the are coupled to the A bus. A second column of control logic 

D bus. The cache memories 24a, 24b, 24c and 24a" are sections (i.e., ASICs A,B; B,B; C,B and D,B) are coupled to 

connected to all four control/data buses, (i.e., the A bus, the the B bus. A third column of control logic sections (i.e., 

B bus, the C bus and the D bus) as shown. j^q. ^q. BjC; cc and D>C ) are coupled to the C bus. A 

Each one of the controllers 20, 22 is adapted to assert on 35 fourth column of control logic sections (i.e., ASICs A,D; 

the control/data bus coupled thereto during a controller B,D; C,D and D,D) are coupled to the D bus. 

initiated control/data bus assert interval: (a) an memory Each one of such control logic sections ASICs AA-D,D 

address; and (b) a command, such command including: (i) is identical in construction, an exemplary one thereof, here 

either a write operation request or a read operation request; control logic section ASIC AA being shown in detail in FIG. 

and, (ii) when a write operation is requested during a 40 4 to include a control logic 50 having control logic and a 

subsequent control/data bus grant interval, data and bus buffer memory as described in the above-referenced 

write clock pulses. A timing protocol suitable for use in the co-pending patent application entitled "TIMING PROTO- 

system 10 is described in co-pending patent application COL FOR A DATA STORAGE SYSTEM" for controlling 

entitled "TIMING PROTOCOL FOR A DATASTORAGE transfer of data between the memory and the one of the 

SYSTEM", inventor John K, Walton, filed on the same date 45 plurality of control/data buses (i.e., A bus, bus B, C bus and 

as this application, assigned to the same assignee as the d bus) coupled to the control logic section ASIC A,A, The 

present invention, the entire subject mater thereof being control logic section ASIC AA is adapted to produce a 

incorporated herein by reference. control/data bus request for the one of the control/data buses 

An exemplary one of the memory sections 24a-24d, here coupled thereto (here RAM region A) and is adapted to 

memory section 24a is shown in detail in FIG. 2. Such 50 effect the transfer in response to a control/data bus grant fed 

memory section 24a includes a plurality of, here four to the control logic section (here ASIC A A) in accordance 

random access memory (RAM) regions (i.e. RAM region A, with the protocol described in the above-referenced, 

RAM region B, RAM region C and RAM region D, as co-pending application entitled "TIMING PROTOCOL 

shown, and a matrix of rows and columns of control logic FOR A DATA STORAGE SYSTEM". The control logic 

sections, here Application Specific Integrated circuits 55 section ASIC A A also includes a bus arbitration section 52 

(ASICs), i.e, control logic section ASIC A A • • . control coupled to one of the four arbitration bus AB 1 -AB 4 here to 

logic section ASIC D,D. Each one of the rows of control arbitration bus AB 3 . The bus arbitration section 52 will be 

logic sections ASIC A,A . . . ASIC D,D is coupled to a described in more detail in connection with FIG. 4. Suffice 

corresponding one of the four control/data buses A bus, B it to say here, however, that the bus arbitration section 52 is 

bus, C bus and D bus and each one of the rows of the control 60 adapted to: (1) receive a control/data bus request from the 

logic sections ASIC A,A . . . ASIC D,D is coupled to a ,control logic in such one of the control logic sections ASICs 

corresponding one of the four RAM regions, RAM region AAAD in the row thereof and from the other control logic 

A . . . RAM region D, as indicated. More particularly, here sections coupled to such arbitration bus AB^ (2) grant 

there are four rows of control logic section ASICs. The first access to the control/data bus to one of the control logic 

row includes ASICs AA; A,B; A,C; and A,D each being 65 sections ASICs AAAD in accordance with the control/data 

coupled to the RAM region A. The second row includes bus requests coupled to the bus arbitration section 52; (3) 

ASICs BA; B,B; B,C; and B,D each being coupled to the receive control/data bus grants from the other control logic 
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sections ASICs A,B-A,D coupled to such arbitration bus 
AB lt and (4) distribute the control/data bus request pro- 
duced by the control logic sections ASICs AA-D,D to the 
other control logic sections ASICs coupled to the arbitration 
bus AB V 5 

Each one of the bus arbitration sections 52 in each of the 
control logic sections ASICs AA-A,D is identical in con- 
struction. An exemplary one thereof, here the bus arbitration 
section 52 in control logic section ASIC A A (FIG. 3) 
includes a majority gate 56 fed by the control/data bus grants 10 
received from the other control logic sections ASICs A,B- 
A,D in the row thereof for producing an internal control/data 
bus grant when a majority of the control logic sections 
ASICS A,A-A,D indicate that the said one of the bus 
arbitration sections 52 has been granted the control/data bus. 35 
More particularly, the bus arbitration section 52 includes an 
internal arbitrator 58 response to control/data bus request 
from the plurality of control logic sections ASICs AA-A,D 
and provides a control/data bus grant (i.e., here logic 1) to 
one of the plurality of control logic sections ASICs AA-A,D 20 
selectively in accordance with a pre-determined criteria. 
Here, the pre-determined criteria is a "first-come/first-serve" 
criteria. That is, the control logic sections ASICS AA-A,D 
in the row thereof are granted the request in the order 
requested. It should be noted that each one of the bus 25 
arbitration sections 52 includes the internal arbitrator 58 
responsive to control/data bus requests from the plurality of 
control logic sections ASICs AA-A,D. Each one of the 
plurality of control logic sections ASICs A A-A,D provides 
a control/data bus grant to one of the plurality of control 30 
logic sections ASICs in the row thereof selectively in 
accordance with a common (i.e., the same, here "first-come/ 
first-serve" criteria.) 

Thus, consider as a first example that the control section 
50 (FIG. 3) of control logic section ASICs A,A generates an 35 
internal request for the control/data bus on line 64. Such 
request is executed by distributing the internally generated 
logic 1 to all other control logic sections ASICs A,B-D,D, 
here via a logic 1 produced at ports Ro(A,B), Ro(A,C) and 
Ro(A,D) of ASIC A,A. It is first noted that the logic 1 40 
condition is represented in FIG. 5Aby heavy weight lines. 
It is also noted that the logic 1 is fed to the internal arbitrator 
58 (FIG. 4). 

The logic 1 signal at ports Ro(A,B), Ro(A,C) and Ro(A, 
D) of control logic section ASIC A,A are fed to ports R(A,A) 45 
of the other control logic sections ASICs A,B; A,C; and A,D, 
as indicated by the heavy weight lines in FIG. 5A. Assuming 
that this is the first request, the bus arbitrator 58 in each of 
the control logic sections ASICs A,B-D,D should, if oper- 
ating correctly, produce a logic 1 at port G(A,A) thereof and 50 
a logic 0 at ports G(A,D) and G(A,C) thereof. The logic 1 
signal produced at ports G(AA) of the control logic sections 
ASICs A,B; A,C; A,D are fed to the ports GI(A,B) GI(A,C) 
and GI(A,D) of control logic section ASIC A,A. The logic 1 
at these ports is fed to the majority gate 56 (FIG. 4) of the 55 
arbitration section 52 in control logic section ASIC A A- 
Because all three AND gates in the majority gate 56 section 
are logic 1, the output of the OR gate in the majority gate 56 
is a logic 1 thereby producing an internal control/data bus 
grant on line 70 to the requesting logic control section ASIC go 
A,A. 

The majority gate 56 arrangement provides the bus arbi- 
tration sections 52 with a degree of fault tolerance. For 
example, consider the example above, except that here logic 
1 signal at port Ro(A,B) of control logic section ASIC AA 65 
is not communicated to port R(AA) of control logic section 
ASIC A,B (indicated by the broken line in FIG. 5B). Thus, 
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control logic section ASIC A,B maintains a logic 0 at port 
G(A,A) thereof (i.e., a logic 1 is not transmitted to port 
GI(A,B) of control logic section ASIC A A) as indicated by 
the dotted line in FIG. 5 A. However, control logic sections 
ASICs A,C and A,D respond properly and produce logic 1 
signals at ports G(AA) thereof. These two logic 1 signals 
are fed to ports GI (A,C) and (A,D) of control logic section 
ASIC A,A. Again the majority gate in control logic section 
ASIC AA produces a logic 1 even though there is a fault 
with the control logic section ASIC AA/control logic sec- 
tion ASIC A,B dialog. 

Considering the first example above, except that here 
logic control section ASIC A,D fails to produce a logic 1 at 
port G(AA) as indicated by the broken line in FIG. 5C. 
Thus, the logic 1 signal at port G(AA) of control logic 
section ASIC A,D is not communicated to port GI (A,D) of 
control logic section ASIC AA- However, control logic 
sections ASICs A,B and A,C respond properly and produce 
logic 1 signals at ports G(AA) thereof. These two logic 1 
signals are fed to ports GI (A,C) and GI(A,B) of control 
logic section ASIC AA- Again, the majority gate in control 
logic section ASIC AA produces a logic 1 even though there 
is a fault with the control logic section ASIC A,A/control 
logic section ASIC A,D dialog. 

Considering the first example above, except that here 
logic control section ASIC A,B produces a logic 1 control/ 
data bus grant to two control logic sections; one correctly to 
control logic section ASIC AA via the G(AA) port thereof 
and one incorrectly by placing a logic 1 at port G(A,D) of 
control logic section ASIC A,B. It is noted that the majority 
gate of control logic section ASIC A A will properly produce 
an internal control/data bus grant signal and the majority 
gate of the control logic section ASIC A,D will not grant a 
control/data bus grant because none of the three AND gates 
in the majority gate of control logic section ASIC A,D will 
produce a logic 1. 

Other embodiments are within the spirit and scope of the 
appended claims. 

What is claimed is: 

1 . A data storage system, comprising: 

(A) a plurality of control/data buses; 

(B) a memory section coupled to the plurality of control/ 
data buses, such memory section, comprising: 

(i) a common memory region; 

(ii) a plurality of control logic sections interconnected 
through an arbitration bus, each one being coupled 
between a corresponding one of the control/data 
buses and the common memory region, each one of 
such control logic sections comprising: 

(a) a control logic for controlling transfer of data 
between the common memory region and the one 
of the plurality of control/data buses coupled to 
said one of the logic sections, such control logic 
producing a control/data bus request for the one of 
the control/data buses coupled thereto and for 
effecting the transfer in response to a control/data 
bus grant fed to the control logic; and 

(b) a bus arbitration section coupled to the arbitration 
bus for: 

(1) receiving a control/data bus request from the 
control logic in such one of the control logic 
sections and from the other control logic sec- 
tions coupled to such arbitration bus; 

(2) for granting access to the control/data bus to 
one of the control logic sections in accordance 
with the bus requests coupled to the bus arbi- 
tration section; 
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(3) for receiving control/data bus grants from the 
other control logic sections coupled to such 
arbitration bus; and 

(4) for distributing the control/data bus request 
produced by the control logic in said control 5 
logic section to the other control logic sections 
coupled to the arbitration bus. 

2. The datastorage system recited in claim 1 wherein each 
one of the bus arbitration sections includes a majority gate 
fed by the bus grants received from the other control logic 10 
sections for producing an internal bus grant when a majority 

of the control logic sections indicate that the said one of the 
bus arbitration sections has been granted the control/data 
bus. 

3. The data storage system recited in claim 1 wherein each is 
one of the bus arbitration sections includes an internal 
arbitrator response to bus request from the plurality of 
control logic sections and provides a bus grant to one of the 
plurality of control logic sections selectively in accordance 
with a predetermined criteria. 20 

4. The data storage system recited in claim 1 wherein each 
one of the bus arbitration sections includes an internal 
arbitrator response to bus request from the plurality of 
control logic sections and wherein each one of the plurality 

of control logic sections provides a bus grant to one of the 25 
plurality of control logic sections selectively in accordance 
with a common pre-determined criteria. 

5. The data storage system recited in claim 3 wherein each 
one of the bus arbitration sections includes a majority gate 
fed by the bus grants received from the other control logic 30 
sections for producing an internal bus grant when a majority 

of the control logic sections indicate that the said one of the 
bus arbitration sections has been granted the control/data 
bus. 

6. The data storage system recited in claim 4 wherein each 35 
one of the bus arbitration sections includes a majority gate 
fed by the bus grants received from the other control logic 
sections for producing an internal bus grant when a majority 

of the control logic sections indicate that the said one of the 
bus arbitration sections has been granted the control/data 40 
bus. 

7. A data storage system wherein a main frame computer 
section having main frame processors for processing data is 
coupled to a bank of disk drives through an interface, such 
interface comprising: 45 

(A) a plurality of control/data buses; 

(B) a plurality of controllers each one thereof asserting on 
the control/data bus during a controller initiated bus 
assert interval: (a) a memory address; and (b) a 
command, such command including: (i) either a write 50 
operation request or a read operation request; and, (ii) 
when a write operation is requested during a subse- 
quent bus grant interval, data and bus write clock 
pulses; 

(C) at least one addressable memory section coupled to 
the plurality of control/data buses, such memory 
section, comprising: 

(i) a common memory region; 

(ii) a plurality of control logic sections interconnected 6Q 
through an arbitration bus, each one being coupled 
between a corresponding one of the control/data 
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buses and the common memory region, each one of 
such control logic sections comprising: 

(a) a control logic for controlling transfer of data 
between the common memory region and the one 
of the plurality of control/data buses coupled to 
said one of the logic sections, such control logic 
for producing a bus request for the one of the 
control/data buses coupled thereto and being 
adapted to effect the transfer in response to a bus 
grant fed to the control logic; and 

(b) a bus arbitration section coupled to the arbitration 
bus for: 

(1) receiving a bus request from the control logic 
in such one of the control logic sections and 
from the other control logic sections coupled to 
such arbitration bus; 

(2) granting access to the control/data bus to one 
of the control logic sections in accordance with 
the bus requests coupled to the bus arbitration 
section; 

(3) receiving bus grants from the other control 
logic sections coupled to such arbitration bus; 
and 

(4) distributing the bus request produced by the 
control logic in said control logic section to the 
other control logic sections coupled to the 
arbitration bus. 

8. The data storage system recited in claim 7 wherein each 
one of the bus arbitration sections includes a majority gate 
fed by the bus grants received from the other control logic 
sections for producing an internal bus grant when a majority 
of the control logic sections indicate that the said one of the 
bus arbitration sections has been granted the control/data 
bus. 

9. The data storage system recited in claim 7 wherein each 
one of the bus arbitration sections includes an internal 
arbitrator response to bus request from the plurality of 
control logic sections and provides a bus grant to one of the 
plurality of control logic sections selectively in accordance 
with a pre-determined criteria. 

10. The data storage system recited in claim 7 wherein 
each one of the bus arbitration sections includes an internal 
arbitrator response to bus request from the plurality of 
control logic sections and wherein each one of the plurality 
of control logic sections provides a bus grant to one of the 
plurality of control logic sections selectively in accordance 
with a common pre-determined criteria. 

11. The data storage system recited in claim 9 wherein 
each one of the bus arbitration sections includes a majority 
gate fed by the bus grants received from the other control 
logic sections for producing an internal bus grant when a 
majority of the control logic sections indicate that the said 
one of the bus arbitration sections has been granted the 
control/data bus. 

12. The data storage system recited in claim 10 wherein 
each one of the bus arbitration sections includes a majority 
gate fed by the bus grants received from the other control 
logic sections for producing an internal bus grant when a 
majority of the control logic sections indicate that the said 
one of the bus arbitration sections has been granted the 
control/data bus. 

* * * * * 
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(57) ABSTRACT 

A system interface includes a plurality of first directors, a 
plurality of second directors, a data transfer section and a 
message network. The data transfer section includes a cache 
memory. The cache memory is coupled to the plurality of 
first and second directors. The messaging network operates 
independently of the data transfer section and such network 
is coupled to the plurality of first directors and the plurality 
of second directors. The first and second directors control 
data transfer between the first directors and the second 
directors in response to messages passing between the first 
directors and the second directors through the messaging 
network to facilitate data transfer between first directors and 
the second directors. The data passes through the cache 
memory in the data transfer section. A method for operating 
a data storage system adapted to transfer data between a host 
computer/server and a bank of disk drives. The method 
includes transferring messages through a messaging net- 
work with the data being transferred between the host 
computer/server and the bank of disk drives through a cache 
memory, such message network being independent of the 
cache memory. 

12 Claims, 25 Drawing Sheets 




12/11/2003, EAST Version: 1.4.1 



US 6,651,130 Bl 

Page 2 



U.S. PATENT DOCUMENTS 

6,178,466 Bl 1/2001 Gilbertson et al 170/3 

6,230,229 Bl * 5/2001 Van Krevelen et al 710/317 

6,275,877 Bl 8/2001 Duda 710/236 

6,317,805 Bl * 11/2001 Chilton et al 710/311 



6,378,029 Bl * 4/2002 Venkitakrishnan et al. , 710/317 

6,389,494 Bl * 5/2002 Walton et al 710/305 

6,397,281 Bl 5/2002 MacLellan et al 710/113 

* cited by examiner 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 1 of 25 US 6,651,130 Bl 




12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 2 of 25 US 6,651,130 Bl 




12/11/2003, EAST version: 1.4.1 



7 

• « 



U.S. Patent Nov. 18, 2003 Sheet 3 of 25 



US 6,651,130 Bl 



1 



2? 



o 



CO 



Q. 

OH 
O 
CO 
Hi 
Q 



<C O 
CO Zj 
CO 5- 

Ui <c 



CM 
CO 



r-- 
-3- 



oo 



co 



gco 
Q- .< 



O 
O 



CO 



o 
o 



o 



1 

— J 

Q_ 



CD 



CO 
CO 

= 1 

o < 

CO O 



CO 
CO 
LL 

or 

CO o 

LU ^ 



z: 
o 




co 



CO 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 4 of 25 US 6,651,130 Bl 



FIG. 3 



Director 
Board 



Cache 
Memory 




essage Network Board 304 1 

essage Network Board 3042 

To/From Disk Drives 140 1 
Backplane 302-^ 4 I I To/From Host Computer/ 

✓^>>^ ^ J / Server/ Host Processor 120 



Director Board 



Cache Memory Boa 




Backplane 
302 



FIG. 4 



12/11/2003, EAST version: 1.4.1 




12/11/2003, EAST version: 1.4.1 




12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 7 of 25 US 6,651,130 Bl 




12/11/2003 , 



east version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 8 of 25 



US 6,651,130 Bl 




12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 9 of 25 



US 6,651,130 Bl 



190 



/Z^220 

MA1-j^^|MA5 



, Joffran Message 
NetortSO 



FIG. 8 



Back 
M4 End 
Ifc^MS Director 
Board 



^-210, 



^M2 


Back 




End 




Director 


Z5m 


Board 




210j 



Back 
End 
Director 
Board 



z~ 210 7 



-.210, 







*-* 




! Back 


0 
o 


End 






Director 


M 




Board 






1 





12/11/2003, east version: 1.4.1 



U.S. Patent 



Nov. 18, 2003 



Sheet 10 of 25 



US 6,651,130 Bl 




12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 11 of 25 



US 6,651,130 Bl 




12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 12 of 25 US 6,651,130 Bl 



406A 



1 SWITCH 
SECTION 



430, 



2 



2 5 « 2 <<<<<< 




TO/FROM 
PORT 
402A 
DIRECTOR 
180, 



432 

VJ 



FIFO 



434 

VJ 



REQUEST 
GENER- 
ATOR 



RA1.4 



RA1.3 



RA1.2 



RA1.1 





r AK3,1 


1 F 




r AK2,1 





/ \ 



RA1.1 

L 



AK1.1 



AK1,1^ 



m 



i 



ARBITER 



-AK1.4 



^-AKI.3 
^-AKU 



4361 

V 




FIG. 8C 



FIG.8C-1 



FIG.8C-2 



FIG. 8C-1 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 13 of 25 US 6,651,130 Bl 




e'zvu- 



fr'eva- 



z 
o 

Ui 
CO 



-WIG 

-ewo 
-mo 



•eaa 

-M3C1 



few- 



hi 

CO 
X 

g 
S 



-im 



•VQQ 
-ZQQ 
-\.QQ 



few- 



I'm- 
twu- 
z'm- 



CO 
X 

p 



-wia 
-ewa 

-VINO 



-€(30 

-zaa 
-ma 



3 



svu- 



CM 

I 



-ZQQ 



CO 



eo 



•wa 



a 



T 



CO 

a 



u-o 



a: 
p 



Si 



0£ 

o 



T 



is!* 

a. ^ ^ 



2 - 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 14 of 25 US 6,651,130 Bl 



m « 8 "8 
5 s o & 

x' *' x' x' 




On 



zie sna ndo 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 15 of 25 US 6,651,130 Bl 




12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 16 of 25 US 6,651,130 Bl 



FIG. 11 



FIG 11A 



FIG.11B 



^510 

32 Byte Command 
field indicating 
bit vector 
Destination 




500 



32 Byte 

Message 

Payload 



Software Generates 
64 Byte Descriptor 



Software Increments Message Engine Xmit Write Pointer 




State Machine 410 Initiates 64 Byte 
Descriptor Transfer Via DMA Xmit operation 



32 Byte Command & 32 Byte Message Payload 
into xmit Data Buffer 



FIG. 11A 




12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 17 of 25 US 6,651,130 Bl 



tyfissaoe Bus S end Operation Continued 




Message Engine encapsulates the 
Cell Paytoad with the MAC header 
and CRC inside the Packetizer 



, , ^-528 

Message Engine transfers 
Packet to Switch Element 320 



^530 



Switch Elements Routes Packet 

to destination node via 
304, , 304 2 , or on-board Director 



FIG. 11B 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 18 of 25 US 6,651,130 Bl 



s 



s 



CO 
CD 



CM 
CO 



CO 
CM 



tr 

'to 
O 



o 
CL 



C 

o 

CL 



o 
o 

QL 



o 


s 


O 


s 




S 


O 


s 


O 


o 


CO 
CO 


o 


CO 
CD 




CO 
CO 




CO 




o 


CM 
CO 


o 


CM 
CO 


O 


CM 
CO 




CM 
CO 


O 


1 






1 

1 












1 

! 


1' 




I 




1 




1 




I 

1 








1 1 




1 


1 




1 


o 




o 




o 








o 


o 


CO 


o 


CO 




CO 




CO 




o 


CM 




CM 


T— 


CM 




CM 








o 




o 






r- 


o 



c 
o 
33 

■55 



CD 



00 



CO 



CO 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 19 of 25 US 6,651,130 Bl 



FIG. 12 



FIG. 12A 



FIG. 12B 




Message Engine de-encapsulates the I m 
message pavload from the packet \ 

i . 

Message Engine initiates 32 Byte paytoad I fi10 
transfer via DMA receive operation \ 



FIG. 12A 




High Prioiity 
Queue 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18,2003 Sheet 20 of 25 US 6,651,130 Bl 



Message Bus Receive Operation Continued 




No 

I 

CPU 310 Reads and processes 
message payload in the receive 
Queue and increments the receive 



read pointer 




FIG 12B 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 21 of 25 



US 6,651,130 Bl 



Message Bus 
Acknm^ement Operatio n 




message paytoad transfer to Recerve Queue 



I 



-700 



Actawwledgement packet 
and transmit to sending node 



-702 



Switch Barents Routes 
Acnowtedgernent Packet to 
destnation node via 304 1, 304 2 
or on-board Detector Node 



High Priority 
Queue 



FIG. 13 



-704 




—706 



Low Priority 
Queue 



Engine Increments the 
send read pointer 



-712 



CPU 310 
processes I 
Status and removes the 
descriptor from the Send Queue 



-714 



12/11/2003, EAST version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 22 of 25 US 6,651,130 Bl 



Xmil CPU flow 



Message Engine 
loads Xmit DMA 
Address Register 



DMA Engine 
loads Xmit DMA 
Counter with 32 byte 
count 




Check for 
Valid 
Queue Address 



-810 



-812 



-814 



FIG. 14A 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 23 of 25 US 6,651,130 Bl 




No. 



Is Burst 
Count =8? 



Yes 



isXmit 
transfer 
Counter 

.Expired?. 



-822 



-824 



No 



FIG. 14B 



Output EOT and 
Status to the 
Message Engine. 

DMA Complete 



-826 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 24 of 25 US 6,651,130 Bl 



Recrosgflow 




FIG. 15A 



12/11/2003, EAST Version: 1.4.1 



U.S. Patent Nov. 18, 2003 Sheet 25 of 25 US 6,651,130 Bl 




12/11/2003, EAST version: 1.4.1 



US 6,651,130 Bl 



DATA STORAGE SYSTEM HAVING 
SEPARATE DATA TRANSFER SECTION AND 
MESSAGE NETWORK WITH BUS 
ARBITRATION 



BACKGROUND OF THE INVENTION 

This invention relates generally to data storage systems, 
and more particularly to data storage systems having redun- 
dancy arrangements to protect against total system failure in 10 
the event of a failure in a component or subassembly of the 
storage system. 

As is known in the art, large host computers and servers 
(collectively referred to herein as "host computer/servers") ^ 
require large capacity data storage systems. These large 
computer/servers generally includes data processors, which 
perform many operations on data introduced to the host 
computer/server through peripherals including the data stor- 
age system. The results of these operations are output to 
peripherals, including the storage system. 

One type of data storage system is a magnetic disk storage 
system. Here a bank of disk drives and the host computer/ 
server are coupled together through an interface. The inter- 
face includes "front end" or host computer/server controllers 25 
(or directors) and "back-end" or disk controllers (or 
directors). The interface operates the controllers (or 
directors) in such a way that they are transparent to the host 
computer/server. That is, data is stored in, and retrieved 
from, the bank of disk drives in such a way that the host 30 
computer/server merely thinks it is operating with its own 
local disk drive. One such system is described in U.S. Pat. 
No. 5,206,939, entitled "System and Method for Disk Map- 
ping and Data Retrieval", inventors Moshe Yanai, Natan 
Vishlitzky, Bruno Alterescu and Daniel Castel, issued Apr. 35 
27, 1993, and assigned to the same assignee as the present 
invention. 

As described in such U.S. Patent, the interface may also 
include, in addition to the host computer/server controllers 
(or directors) and disk controllers (or directors), addressable 40 
cache memories. The cache memory is a semiconductor 
memory and is provided to rapidly store data from the host 
computer/server before storage in the disk drives, and, on 
the other hand, store data from the disk drives prior to being 
sent to the host computer/server. The cache memory being a 45 
semiconductor memory, as distinguished from a magnetic 
memory as in the case of the disk drives, is much faster than 
the disk drives in reading and writing data. 

The host computer/server controllers, disk controllers and 
cache memory are interconnected through a backplane 50 
printed circuit board. More particularly, disk controllers are 
mounted on disk controller printed circuit boards. The host 
computer/server controllers are mounted on host computer/ 
server controller printed circuit boards. And, cache memo- 
ries are mounted on cache memory printed circuit boards. 55 
The disk directors, host computer/server directors, and cache 
memory printed circuit boards plug into the backplane 
printed circuit board. In order to provide data integrity in 
case of a failure in a director, the backplane printed circuit 
board has a pair of buses. One set the disk directors is 60 
connected to one bus and another set of the disk directors is 
connected to the other bus. Likewise, one set the host 
computer/server directors is connected to one bus and 
another set of the host computer/server directors is directors 
connected to the other bus. The cache memories are con- 65 
nected to both buses. Each one of the buses provides data, 
address and control information. 



The arrangement is shown schematically in FIG. 1. Thus, 
the use of two buses Bl, B2 provides a degree of redundancy 
to protect against a total system failure in the event that the 
controllers or disk drives connected to one bus, fail. Further, 
the use of two buses increases the data transfer bandwidth of 
the system compared to a system having a single bus. Thus, 
in operation, when the host computer/server 12 wishes to 
store data, the host computer 12 issues a write request to one 
of the front-end directors 14 (i.e., host computer/server 
directors) to perform a write command. One of the front-end 
directors 14 replies to the request and asks the host computer 
12 for the data. After the request has passed to the requesting 
one of the front -end directors 14, the director 14 determines 
the size of the data and reserves space in the cache memory 
18 to store the request. The front-end director 14 then 
produces control signals on one of the address memory 
blisses Bl, B2 connected to such front-end director 14 to 
enable the transfer to the cache memory 18. The host 
computer/server 12 then transfers the data to the front -end 
director 14. The front-end director 14 then advises the host 
computer/server 12 that the transfer is complete. The front- 
end director 14 looks up in a Table, not shown, stored in the 
cache memory 18 to determine which one of the back-end 
directors 20 (i.e., disk directors) is to handle this request. 
The Table maps the host computer/server 12 addresses into 
an address in the bank 14 of disk drives. The front-end 
director 14 then puts a notification in a "mail box" (not 
shown and stored in the cache memory 18) for the back-end 
director 20, which is to handle the request, the amount of the 
data and the disk address for the data. Other back-end 
directors 20 poll the cache memory 18 when they are idle to 
check their "mail boxes". If the polled "mail box" indicates 
a transfer is to be made, the back-end director 20 processes 
the request, addresses the disk drive in the bank 22, reads the 
data from the cache memory 18 and writes it into the 
addresses of a disk drive in the bank 22. 

When data is to be read from a disk drive in bank 22 to 
the host computer/server 12 the system operates in a* recip- 
rocal manner. More particularly, during a read operation, a 
read request is instituted by the host computer/server 12 for 
data at specified memory locations (i.e., a requested data 
block). One of the front-end directors 14 receives the read 
request and examines the cache memory 18 to determine 
whether the requested data block is stored in the cache 
memory 18. If the requested data block is in the cache 
memory 18, the requested data block is read from the cache 
memory 18 and is sent to the host computer/server 12. If the 
front-end director 14 determines that the requested data 
block is not in the cache memory 18 (i.e., a so-called "cache 
miss") and the director 14 writes a note in the cache memory 
18 (i.e., the "mail box") that it needs to receive the requested 
data block. The back-end directors 20 poll the cache 
memory 18 to determine whether there is an action to be 
taken (i.e., a read operation of the requested block of data). 
The one of the back-end directors 20 which poll the cache 
memory 18 mail box and detects a read operation reads the 
requested data block and initiates storage of such requested 
data block stored in the cache memory 18. When the storage 
is completely written into the cache memory 18, a read 
complete indication is placed in the "mail box" in the cache 
memory 18. It is to be noted that the front-end directors 14 
are polling the cache memory 18 for read complete indica- 
tions. When one of the polling front-end directors 14 detects 
a read complete indication, such front-end director 14 com- 
pletes the transfer of the requested data which is now stored 
in the cache memory 18 to the host computer/server 12. 

The use of mailboxes and polling requires time to transfer 
data between the host computer/server 12 and the bank 22 of 
disk drives thus reducing the operating bandwidth of the 
interface. 
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SUMMARY OF THE INVENTION 

In accordance with the present invention, a system inter- 
face is provided. Such interface includes a plurality of first 
directors, a plurality of second directors, a data transfer 
section and a message network. The data transfer section 5 
includes a cache memory. The cache memory is coupled to 
the plurality of first and second directors. The messaging 
network operates independently of the data transfer section 
and such network is coupled to the plurality of first directors 
and the plurality of second directors. The first and second 10 
directors control data transfer between the first directors and 
the second directors in response to message passing between 
the first directors and the second directors through the 
messaging network to facilitate data transfer between first 
directors and the second directors. The data passes through 15 
the cache memory in the data transfer section. 

With such an arrangement, the cache memory in the data 
transfer section is not burdened with the task of transferring 
the director messaging but rather a messaging network is ^ 
provided, operative independent of the data transfer section, 
for such messaging thereby increasing the operating band- 
width of the system interface. 

In one embodiment of the invention, the system interface 
each one of the first directors includes a data pipe coupled 25 
between an input of such one of the first directors and the 
cache memory and a controller for transferring the messages 
between the message network and such one of the first 
directors. 

In one embodiment each one of the second directors 
includes a data pipe coupled between an input of such one 
of the second directors and the cache memory and a con- 
troller for transferring the messages between the message 
network and such one of the second directors. 

In one embodiment the directors include: a data pipe 35 
coupled between an input of such one of the first directors 
and the cache memory; a microprocessor; and a controller 
coupled to the microprocessor and the data pipe for con- 
trolling the transfer of the messages between the message 
network and such one of the first directors and for control- 40 
ling the data between the input of such one of the first 
directors and the cache memory. 

In accordance with another feature of the invention, a data 
storage system is provided for transferring data between a 
host computer/server and a bank of disk drives through a 45 
system interface. The system interface includes a plurality of 
first directors coupled to host computer/server, a plurality of 
second directors coupled to the bank of disk drives, a data 
transfer section, and a message network. The data transfer 
section includes a cache memory. The cache memory is 50 
coupled to the plurality of first and second directors. The 
message network is operative independently of the data 
transfer section and such network is coupled to the plurality 
of first directors and the plurality of second directors. The 
first and second directors control data transfer between the 55 
host computer and the bank of disk drives in response to 
messages passing between the first directors and the second 
directors through the messaging network to facilitate the 
data transfer between host computer/server and the bank of 
disk drives with such data passing through the cache 60 
memory in the data transfer section. 

In accordance with yet another embodiment, a method is 
provided for operating a data storage system adapted to 
transfer data between a host computer/server and a bank of 
disk drives. The method includes transferring messages 65 
through a messaging network with the data being transferred 
between the host computer/server and the bank of disk 
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drives through a cache memory, such message network 
being independent of the cache memory. 

In accordance with another embodiment, a method is 
provided for. operating a data storage system adapted to 
transfer data between a host computer/server and a bank of 
disk drives through a system interface. The interface 
includes a plurality of first directors coupled to host 
computer/server, a plurality of second directors coupled to 
the bank of disk drives; and a data transfer section having a 
cache memory, such cache memory being coupled to the 
plurality of first and second directors. The method comprises 
transferring the data between the host computer/server and 
the bank of disk drives under control of the first and second 
directors in response to messages passing between the first 
directors and the second directors through a messaging 
network to facilitate the data transfer between host 
computer/server and the bank of disk drives with such data 
passing through the cache memory in the data transfer 
section, such message network being independent of the 
cache memory. 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other features of the invention will become 
more readily apparent from the following detailed descrip- 
tion when read together with the accompanying drawings, in 
which: 

FIG. 1 is a block diagram of a data storage system 
according to the PRIOR ART; 

FIG. 2 is a block diagram of a data storage system 
according to the invention; 

FIG. 2 A shows the fields of a descriptor used in the system 
interface of the data storage system of FIG. 2; 

FIG. 2B shows the filed used in a MAC packet used in the 
system interface of the data storage system of FIG. 2; 

FIG. 3 is a sketch of an electrical cabinet storing a system 
interface used in the data storage system of FIG. 2; 

FIG. 4 is a diagramatical, isometric sketch showing 
printed circuit boards providing the system interface of the 
data storage system of FIG. 2; 

FIG. 5 is a block diagram of the system interface used in 
the data storage system of FIG. 2; 

FIG. 6 is a block diagram showing the connections 
between front-end and back-end directors to one of a pair of 
message network boards used in the system interface of the 
data storage system of FIG. 2; 

FIG. 7 is a block diagram of an exemplary one of the 
director boards used in the system interface of he data 
storage system of FIG. 2; 

FIG. 8 is a block diagram of the system interface used in 
the data storage system of FIG. 2; 

FIG. 8 A is a diagram of an exemplary global cache 
memory board used in the system interface of FIG, 8; 

FIG. 8B is a diagram showing a pair of director boards 
coupled between a pair of host processors and global cache 
memory boards used in the system interface of FIG. 8; 

FIG. 8C is a block diagram of an exemplary crossbar 
switch used in the front-end and read-end directors of the 
system interface of FIG. 8; 

FIG. 9 is a block diagram of a transmit Direct Memory 
Access (DMA) used in the system interface of the FIG. 8; 

FIG. 10 is a block diagram of a receive DMA used in the 
system interface of FIG. 8; 

FIG. 11 shows the relationship between FIGS. 11A and 
11B, such FIGS. 11 A and 11 B together showing a process 
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flow diagram of the send operation of a message network network 260 operates independent of the data transfer sec- 
used in the system interface of FIG, 8; tion 240 thereby increasing the operating bandwidth of the 

FIGS. 11C-11E are examples of digital words used by the s y stem interface 160. 

message network in the system interface of FIG. 8; In operation, and considering first a read request by the 

FIG. 11F shows bits in a mask used in such message 5 host computer/server 120 (i.e. the host computer/server 120 

network requests data from the bank of disk drives 140), the request 

iTir- iir u *u n f *u , fc . riir , is Passed from one of a plurality of, here 32, host computer 

HG^llGshowstheresultof the maskof FIG. llFapphed processors 12 i rl 21 32 in the host computer 120 to one or 

to the digital word shown in FIG. HE; more of lhe paif of the from _ end direclors 180i _ 18 o 32 

FIG. 12 shows the relationship between FIGS. 12A and ao connected to such host computer processor 121,-121 32 . (It 

12B, such FIGS. 12A and 12B Showing a process flow is noted that in the host computer 120, each one of the host 

diagram of the receive operation of a message network used computer processors 121,-12132 is coupled to here a pair 

in the system interface of FIG. 8; (but not limited to a pair) of the front-end directors 

FIG. 13 shows the relationship between FIGS. 11 A and 180,-180 32 to provide redundancy in the event of a failure 

11B, such FIGS. UA and 11B together showing a process is in one of the front end-directors 180,-180 32 coupled there to. 

flow diagram of the acknowledgement operation of a mes- Likewise, the bank of disk drives 140 has a plurality of, here 

sage network used in the system interface of FIG. 8; 32, disk drives 141,-141 32 , each disk drive 141,-141 32 

FIGS. 14A and 14B show process flow diagrams of the bein S coupled to here a pair (but not limited to a pair) of the 

transmit DMA operation of the transmit DMA of FIG. 9; and back-end directors 200,-200 32 , to provide redundancy in the 

FIGS. 15A and 15B show process flow diagrams of the 20 ™ Dt ° f * failu ? ° M *e back-end directors 

receive DMA operation of the receive DMA of FIG. 10. ^"? 0 °« 32 ^ E ^ ^'"i^ 6 * 0 ' 

180,-180 32 includes a microprocessor QiP) 299 (i.e., a 

DETAILED DESCRIPTION central processing unit (CPU) and RAM) and will be 

described in detail in connection with FIGS. 5 and 7. Suffice 

Referring now to FIG. 2, a data storage system 100 is 2 5 it to say here, however, that the microprocessor 299 makes 

shown for transferring data between a host computer/server a request from the data from the global cache memory 220. 

120 and a bank of disk drives 140 through a system interface The global cache memory 220 has a resident cache man- 

160. The system interface 160 includes: a plurality of, here agement table, not shown. Every director 180,-180 32 , 

32 front-end directors 180,-18032 coupled to the host 200,-200 32 has access to the resident cache management 

computer/server 120 via ports-123 32 ; a plurality of back-end 30 table and every time a front -end director 180,-180 32 

directors 200,-200 32 coupled to the bank of disk drives 140 requests a data transfer, the front-end director 180J-18032 

via ports 12333-123^; a data transfer section 240, having a must query the global cache memory 220 to determine 

global cache memory 220, coupled to the plurality of whether the requested data is in the global cache memory 

front-end directors lSOj-180,6 and the back-end directors 220. If the requested data is in the global cache memory 220 

200 1 -200 16 ; and a messaging network 260, operative inde- 35 (i.e., a read "hit"), the front-end director 180,-180 32 , more 

pendently of the data transfer section 240, coupled to the particularly the microprocessor 299 therein, mediates a 

plurality of front-end directors 180,-180 32 and the plurality DMA (Direct Memory Access) operation for the global 

of back-end directors 200,-200 32 , as shown. The front-end cache memory 220 and the requested data is transferred to 

and back-end directors 180,-180 32 , 200,-200 32 are func- the requesting host computer processor 121,-121 3 2. 

tionally similar and include a microprocessor (oP) 299 (i.e., 40 If, on the other hand, the front-end director 180,-18032 

a central processing unit (CPU) and RAM), a message receiving the data request determines that the requested data 

engine/CPU controller 314 and a data pipe 316 to be is not in the global cache memory 220 (i.e., a "miss") as a 

described in detail in connection with FIGS. 5, 6 and 7. result of a query of the cache management table in the global 

Suffice it to say here, however, that the front-end and cac h e memory 220, such front-end director 180,-18032 

back-end directors 180,-18032, 200,-20032 control data 45 concludes that the requested data is in the bank of disk drives 

transfer between the host computer/server 120 and the bank 140. Thus the front-end director 180,-180 32 that received 

of disk drives 140 in response to messages passing between the request for the data must make a request for the data from 

the directors 180,-18032, 200,-200 3 2 through the messag- one of the back-end directors 200,-200 32 in order for such 

ing network 260. The messages facilitate the data transfer back-end director 200,-200 32 to request the data from the 

between host computer/server 120 and the bank of disk 50 bank of disk drives 140. The mapping of which back-end 

drives 140 with such data passing through the global cache directors 200,-200 32 control which disk drives Hlj-Hl^ 

memory 220 via the data transfer section 240. More i n the bank of disk drives 140 is determined during a 

particularly, in the case of the front-end directors power-up initialization phase. The map is stored in the 

180,-180 32) the data passes between the host computer to global cache memory 220. Thus, when the front-end director 

the global cache memory 220 through the data pipe 316 in 55 180 a -180 3 2 makes a request for data from the global cache 

the front-end directors 180,-180 32 and the messages pass memory 220 and determines that the requested data is not in 

through the message engine/CPU controller 314 in such the global cache memory 220 (i.e., a "miss"), the front-end 

front-end directors 180,-180 32 . In the case of the back-end director 180,-180 32 is also advised by the map in the global 

directors 200,-200 32 the data passes between the back-end cache memory 220 of the back-end director 200,-200 32 

directors 200,-20032 and the bank of disk drives 140 and the 60 responsible for the requested data in the bank of disk drives 

global cache memory 220 through the data pipe 316 in the 140. The requesting front-end director 180,-180 32 then 

back-end directors 200^20032 and again the messages pass must ma ke a request for the data in the bank of disk drives 

through the message engine/CPU controller 314 in such 140 from the map designated back-end director 200,-200 32 . 

back-end director 200,-200 32 . request between the front-end director 180,-180 32 and 

With such an arrangement, the cache memory 220 in the 65 the appropriate one of the back-end directors 200,-200 32 (as 

data transfer section 240 is not burdened with the task of determined by the map stored in the global cache memory 

transferring the director messaging. Rather the messaging 200) is by a message which passes from the front-end 
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director 180j-180 32 through the message network 260 to the 
appropriate back-end director 200^20032- It is noted then 
that the message does not pass through the global cache 
memory 220 (i.e., does not pass through the data transfer 
section 240) but rather passes through the separate, inde- 5 
pendent message network 260. Thus, communication 
between the directors lSO^-lSO-^, 200 3 -200 32 is through 
the message network 260 and not through the global cache 
memory 220. Consequently, valuable bandwidth for the 
global cache memory 220 is not used for messaging among Q 
the directors ISOj-ISO^, 200^20032. 

Thus, on a global cache memory 220 "read miss", the 
front -end director 180 1 -180 32 sends a message to the appro- 
priate one of the back-end directors 200 1 -200 32 through the 
message network 260 to instruct such back-end director 1S 
200 1 -200 32 to transfer the requested data from the bank of 
disk drives 140 to the global cache memory 220. When 
accomplished, the back-end director 200!-200 32 advises the 
requesting front-end director 180^18032 that the transfer is 
accomplished by a message, which passes from the back- 2 q 
end director 200 a -200 32 to the front-end director 
180J-18032 through the message network 260. In response 
to the acknowledgement signal, the front -end director 
180^18032 is thereby advised that such front-end director 
180J-18032 can transfer the data from the global cache 25 
memory 220 to the requesting host computer processor 
121 1 -121 32 as described above when there is a cache "read 
hit". 

It should be noted that there might be one or more 
back-end directors 200.l-200.j2 responsible for the requested 30 
data. Thus, if only one back-end director 200 1 -200 32 is 
responsible for the requested data, the requesting front-end 
director 180 : -180 32 sends a uni-cast message via the mes- 
sage network 260 to only that specific one of the back-end 
directors 200^20032 . On the other hand, if more than one of 35 
the back-end directors 200^20032 is responsible for the 
requested data, a multi-cast message (here implemented as 
a series of uni-cast messages) is sent by the requesting one 
of the front-end directors 180J-18032 to all of the back-end 
directors 200^20032 having responsibility for the requested 40 
data. In any event, with both a uni-cast or multi-cast 
message, such message is passed through the message 
network 260 and not through the data transfer section 240 
(i.e., not through the global cache memory 220). 

Likewise, it should be .noted that while one of the host 45 
computer processors 121 1 -121 32 might request data, the 
acknowledgement signal may be sent to the requesting host 
computer processor 12 or one or more other host computer 
processors 12^-12132 via a multi-cast (i.e., sequence of 
uni-cast) messages through the message network 260 to 50 
complete the data read operation. 

Considering a write operation, the host computer 120 
wishes to write data into storage (i.e., into the bank of disk 
drives 140). One of the front-end directors 180 1 -180 32 
receives the data from the host computer 120 and writes it 55 
into the global cache memory 220. The front-end director 
180J-18032 then requests the transfer of such data after 
some period of time when the back-end director 200 1 -200 32 
determines that the data can be removed from such cache 
memory 220 and stored in the bank of disk drives 140. 60 
Before the transfer to the bank of disk drives 140, the data 
in the cache memory 220 is tagged with a bit as "fresh data" 
(i.e., data which has not been transferred to the bank of disk 
drives 140, that is data which is "write pending"). Thus, if 
there are multiple write requests for the same memory 65 
location in the global cache memory 220 (e.g., a particular 
bank account) before being transferred to the bank of disk 



drives 140, the data is overwritten in the cache memory 220 
with the most recent data. Each time data is transferred to the 
global cache memory 220, the front-end director 180J-18032 
controlling the transfer also informs the host computer 120 
that the transfer is complete to thereby free-up the host 
computer 120 for other data transfers. 

When it is time to transfer the data in the global cache 
memory 220 to the bank of disk drives 140, as determined 
by the back-end director 200^-20032, tne back-end director 
200 3 -200 32 transfers the data from the global cache memory 
220 to the bank of disk drives 140 and resets the tag 
associated with data in the global cache memory 220 (i.e., 
un-tags the data) to indicate that the data in the global cache 
memory 220 has been transferred to the bank of disk drives 
140. It is noted that the un-tagged data in the global cache 
memory 220 remains there until overwritten with new data. 

Referring now to FIGS. 3 and 4, the system interface 160 
is shown to include an electrical cabinet 300 having stored 
therein: a plurality of, here eight front-end director boards 
190 1 -190 8 , each one having here four of the front -end 
directors 180^18032; a plurality of, here eight back-end 
director boards 210j-210g, each one having here four of the 
back-end directors 200^20032; and a plurality of, here 
eight, memory boards 220* which together make up the 
global cache memory 220. These boards plug into the front 
side of a backplane 302. (It is noted that the backplane 302 
is a mid-plane printed circuit board). Plugged into the 
backside of the backplane 302 are message network boards 
304 l5 304 2 . The backside of the backplane 302 has plugged 
into it adapter boards, not shown in FIGS. 2-4, which couple 
the boards plugged into the back-side of the backplane 302 
with the computer 120 and the bank of disk drives 140 as 
shown in FIG. 2. That is, referring again briefly to FIG. 2, 
an I/O adapter, not shown, is coupled between each one of 
the front-end directors 180 1 -180 32 and the host computer 
120 and an I/O adapter, not shown, is coupled between each 
one of the back-end directors 200^200^ and the bank of 
disk drives 140. 

Referring now to FIG. 5, the system interface 160 is 
shown to include the director boards ISWj-lSWg, 210^2108 
and the global cache memory 220, plugged into the back- 
plane 302 and the disk drives 141 a -141 32 in the bank of disk 
drives along with the host computer 120 also plugged into 
the backplane 302 via I/O adapter boards, not shown. The 
message network 260 (FIG. 2) includes the message net- 
work boards 304 1 and 304 2 . Each one of the message 
network boards 304 a and 304 2 is identical in construction. A 
pair of message network boards 304! and 304 2 is used for 
redundancy and for message load balancing. Thus, each 
message network board 304 n , 304 2 , includes a controller 306 
(i.e., an initialization and diagnostic processor comprising a 
CPU, system controller interface and memory, as shown in 
FIG. 6 for one of the message network boards 304 a , 304 2i 
here board 304j) and a crossbar switch section 308 (e.g., a 
switching fabric made up of here four switches 308j-308 4 ). 

Referring again to FIG. 5, each one of the director boards 
190 1 -210 8 includes, as noted above four of the directors 
180 3 -180 32 , 200 1 -200 32 (FIG. 2). It is noted that the director 
boards 190^1908 having four front-end directors per board, 
180^18032 are referred to as front-end directors and the 
director boards 210^2108 having four back-end directors 
per board, 200^-20032 are referred to as back-end directors. 
Each one of the directors 180^18032, 200j-200 3 2 includes 
a CPU 310, a RAM 312 (which make up the microprocessor. 
299 referred to above), the message engine/CPU controller 
314, and the data pipe 316. 

Each one of the director boards 190 1 -210 8 includes a 
crossbar switch 318. The crossbar switch 318 has four 
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input/output ports 319, each one being coupled to the data 
pipe 316 of a corresponding one of the four directors 
lSOi-18032, 200 r 200 32 on the director board.liW^lOg, 
The crossbar switch 318 has eight output/input ports col- 
, lectively identified in FIG. 5 by numerical designation 321 5 
(which plug into the backplane 302. The crossbar switch 318 
on the front-end director boards l^-llUg is used for 
coupling the data pipe 316 of a selected one of the four 
front -end directors 18(^-18032 on me front-end director 
board l^-lWg to the global cache memory 220 via the 10 
backplane 302 and I/O adapter, not shown. The crossbar 
switch 318 on the back-end director boards 210J-2108 is 
used for coupling the data pipe 316 of a selected one of the 
four back-end directors 200^20032 on the back-end director 
board 210 1 -210 8 to the global cache memory 220 via the 15 
backplane 302 and I/O adapter, not shown. Thus, referring 
to FIG. 2, the data pipe 316 in the front-end directors 
180 1 -180 32 couples data between the host computer 120 and 
the global cache memory 220 while the data pipe 316 in the 
back-end directors 200 1 -200 32 couples data between the 2 o 
bank of disk drives 140 and the global cache memory 220. 
It is noted that there are separate point-to-point data paths 
Pi~Pd4 (FIG. 2) between each one of the directors 
180J-18032, 200^20032 and the global cache memory 220. 
It is also noted that the backplane 302 is a passive backplane 2 s 
because it is made up of only etched conductors on one or 
more layers of a printed circuit board. That is, the backplane 
302 does not have any active components. 

Referring again to FIG. 5, each one of the director boards 
190 1 -210 8 includes a crossbar switch 320. Each crossbar 30 
switch 320 has four input/output ports 323, each one of the 
four input/output ports 323 being coupled to the message 
engine/CPU controller 314 of a corresponding one of the 
four directors lSOj-18032, 200^-20032 0° the director board 
190J-2108. Each crossbar switch 320 has a pair of output/ 35 
input ports 325 a , 325 2 , which plug into the backplane 302. 
Each port 325^3252 is coupled to a corresponding one of 
the message network boards 304^ 304 2 , respectively, 
through the backplane 302. The crossbar switch 320 on the 
front -end director boards 190^1908 is used to couple the 40 
messages between the message engine/CPUcontroller 314 
of a selected one of the four front-end directors 180 a -180 3 2 
on the front-end director boards 190 1 -190 8 and the message 
network 260, FIG. 2. Likewise, the back-end director boards 
210J-2108 are used to couple the messages produced by a 45 
selected one of the four back-end directors 200 3 -200 32 on 
the back-end director board 210J-2108 between the message 
engine/CPU controller 314 of a selected one of such four 
back-end directors and the message network 260 (FIG. 2), 
Thus, referring also to FIG. 2, instead of having a separate 50 
dedicated message path between each one of the directors 
I8O1-I8O32, 200^20032 and the message network 260 
(which would require M individual connections to the 
backplane 302 for each of the directors, where M is an 
integer), here only M/4 individual connections are required). 55 
Thus, the total number of connections between the directors 
180 a -180 32 200J-20032 and the backplane 302 is reduced to 
l /4th. Thus, it should be noted from FIGS. 2 and 5 that the 
message network 260 (FIG. 2) includes the crossbar switch 
320 and the message network boards 304 3 , 304 2 . go 

Each message is a 64-byte descriptor, shown in FIG. 2 A) 
which is created by the CPU 310 (FIG. 5) under software 
control and is stored in a send queue in RAM 312. When the 
message is to be read from the send queue in RAM 312 and 
transmitted through the message network 260 (FIG. 2) to 65 
one or more other directors via a DMA operation to be 
described, it is packetized in the packetizer portion of 



packetizer/de-packetizer 428 (FIG. 7) into a MAC type 
packet, shown in FIG. 2B, here using the NGIO protocol 
specification. There are three types of packets: a message 
packet section; an acknowledgement packet; and a message 
network fabric management packet, the latter being used to 
establish the message network routing during initialization 
(i.e., during power-up). Each one of the MAC packets has: 
an 8-byte header which includes source (i.e., transmitting 
director) and destination (i.e., receiving director) address; a 
payload; and terminates with a 4-byte Cyclic Redundancy 
Check (CRC), as shown in FIG. 2B. The acknowledgement 
packet (i.e., signal) has a 4-byte acknowledgment payload 
section. The message packet has a 32-byte payload section. 
The Fabric Management Packet (FMP) has a 256-byte 
payload section. The MAC packet is sent to the crossbar 
switch 320. The destination portion of the packet is used to 
indicate the destination for the message and is decoded by 
the switch 320 to determine which port the message is to be 
routed. The decoding process uses a decoder table 327 in the 
switch 318, such table being initialized by controller during 
power-up by the initialization and diagnostic processor 
(controller) 306 (FIG. 5). The table 327 (FIG. 7) provides the 
relationship between the destination address portion of the 
MAC packet, which identifies the routing for the message 
and the one of the four directors 180J-18032, 200-^200-32 on 
the director board 190^-1908, 210 1 -210 8 or to one of the 
message network boards 304 : , 304 2 to which the message is 
to be directed. 

More particularly, and referring to FIG. 5, a pair of 
output/input ports 325 u 325 2 is provided for each one of the 
crossbar switches 320, each one being coupled to a corre- 
sponding one of the pair of message network boards 304 a , 
304 2 . Thus, each one of the message network boards 304 I , 
304 2 has sixteen input/output ports 322^322^, each one 
being coupled to a corresponding one of the output/input 
ports 325^ 325 2 , respectively, of a corresponding one of the 
director boards 190^1908, 210J-210Q through the back- 
plane 302, as shown. Thus, considering exemplary message 
network board 304^ FIG. 6, each switch 308 3 -308 4 also 
includes three coupling ports 324-,-324 3 . The coupling ports 
324^3243 are used to interconnect the switches 322 1 -322 4 , 
as shown in FIG. 6. Thus, considering message network 
board 304 2 , input/output ports 322^3228 are coupled to 
output/input ports 325 A of front-end director boards 
190 1 -190 8 and input/output ports 322 9 -322 16 are coupled to 
output/input ports 325 1 of back-end director boards 
210 1 -210 8 , as shown. Likewise, considering message net- 
work board 304 2 , input/output ports 322j-322 8 thereof are 
coupled, via the backplane 302, to output/input ports 325 2 of 
front-end director boards 190^1908 and input/output ports 
322 9 -322 lfi are coupled, via the backplane 302, to output/ 
input ports 325 2 of back-end director boards 210J-210Q. 

As noted above, each one of the message network boards 
304j, 304 2 includes a processor 306 (FIG. 5) and a crossbar 
switch section 308 having four switches 308j-308 4 , as 
shown in FIGS. 5 and 6. The switches 308 a -308 4 are 
interconnected as shown so that messages can pass between 
any pair of the input/output ports 322 1 -322 16 . Thus, it 
follow that a message from any one of the front -end direc- 
tors 180 3 -180 32 can be coupled to another one of the 
front-end directors 180.,_180 3 2 and/or to any one of the 
back-end directors 200 1 -200 32 . Likewise, a message from 
any one of the back-end directors 180^18032 can be 
coupled to another one of the back-end directors 180^18032 
and/or to any one of the front-end directors 200^20032. 

As noted above, each MAC packet (FIG. 2B) includes in 
an address destination portion and a data payload portion. 
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The MAC header is used to indicate the destination for the 
MAC packet and such MAC header is decoded by the switch 
to determine which port the MAC packet is to be routed. The 
decoding process uses a table in the switch 308j-308 4 , such 
table being initialized by processor 306 during power-up. 5 
The table provides the relationship between the MAC 
header, which identifies the destination for the MAC packet 
and the route to be taken through the message network. 
Thus, after initialization, the switches 320 and the switches 
308 1 -308 4 in switch section 308 provides packet routing aQ 
which enables each one of the directors 180 1 -180 32 , 
200J-20032 to transmit a message between itself and any 
other one of the directors, regardless of whether such other 
director is on the same director board 190 1 -180 8 , 210^2^ 
or on a different director board. Further, the MAC packet has 
an additional bit B in the header thereof, as shown in FIG. 35 
2B, which enables the message to pass through message 
network board 304j or through message network board 
304 2 . During normal operation, this additional bit B is 
toggled between a logic 1 and a logic 0 so that one message 
passes through one of the redundant message network 20 
boards 304 2 , 304 2 and the next message to pass through the 
one of the message network boards 304^ 304 2 to balance the 
load requirement on the system. However, in the event of a 
failure in one of the message network boards 304^ 304 2) the 
non-failed one of the boards 304 ls 304 2 is used exclusively 25 
until the failed message network board is replaced. 

Referring now to FIG. 7, an exemplary one of the director 
boards lSK^-lWg, 210 a -210 8 , here director board 190 n is 
shown to include directors 180 a , 180 3 , 180 5 and 180 7 . An 
exemplary one of the directors 180i-180 4 , here \S0 1 is 3Q 
shown in detail to include the data pipe 316, the message 
engine/CPU controller 314, the RAM 312, and the CPU 310 
all coupled to the CPU interface bus 317, as shown. The 
exemplary director IS0 1 also includes: a local cache 
memory 319 (which is coupled to the CPU 310); the 
crossbar switch 318; and, the crossbar switch 320, described 35 
briefly above in connection with FIGS. 5 and 6. The data 
pipe 316 includes a protocol translator 400, a quad port 
RAM 402 and a quad port RAM controller 404 arranged as 
shown. Briefly, the protocol translator 400 converts between 
the protocol of the host computer 120, in the case of a 40 
front -end director 180J-18032, (and between the protocol 
used by the disk drives in bank 140 in the case of a back-end 
director 200 1 -200 32 ) and the protocol between the directors 
180J-1803, 200^20032 and the global memory 220 (FIG. 
2). More particularly, the protocol used the host computer 45 
120 may, for example, be fibre channel, SCSI, ESCON or 
FICON, for example, as determined by the manufacture of 
the host computer 120 while the protocol used internal to the 
system interface 160 (FIG. 2) may be selected by the 
manufacturer of the interface 160. The quad port RAM 402 50 
is a FIFO controlled by controller 404 because the rate data 
coming into the RAM 402 may be different from the rate 
data leaving the RAM 402. The RAM 402 has four ports, 
each adapted to handle an 18 bit digital word. Here, the 
protocol translator 400 produces 36 bit digital words for the 55 
system interface 160 (FIG. 2) protocol, one 18 bit portion of 
the word is coupled to one of a pair of the ports of the quad 
port RAM 402 and the other 18 bit portion of the word is 
coupled to the other one of the pair of the ports of the quad 
port RAM 402. The quad port RAM has a pair of ports 60 
402A, 402B, each one of to ports 402A, 402B being adapted 
to handle an 18 bit digital word. Each one of the ports 402 A, 
402B is independently controllable and has independent, but 
arbitrated, access to the memory array within the RAM 402. 
Data is transferred between the ports 402A, 402B and the 65 
cache memory 220 (FIG. 2) through the crossbar switch 318, 
as shown. 



The crossbar switch 318 includes a pair of switches 406A, 
406B. Each one of the switches 406 A, 406B includes four 
input/output director-side ports D!-D 4 (collectively referred 
to above in connection with FIG. 5 as port 319) and four 
input/output memory-side ports M a -M 4 , M 5 -M 8 , 
respectively, as indicated. The input/output memory-side 
ports Mj-M 4 , Mj-Mg were collectively referred to above in 
connection with FIG. 5 as port 317). The director-side ports 
D a -D 4 of switch 406A are connected to the 402A ports of 
the quad port RAMs 402 in each one the directors 180^ 
I8O3, 180 5 and 180 7 , as indicated. Likewise, director-side 
ports of switch 406B are connected to the 402B ports of the 
quad port RAMs 402 in each one the directors 180 3 , 180 3 , 

180 5 , and 180 7 as indicated. The ports Dj-D 4 are selectively 
coupled to the ports M a -M 4 in accordance with control 
words provided to the switch 406A by the controllers in 
directors 180 1? 180 3 , 180 5 , 180 7 on busses R X1 -R A4 , 
respectively, and the ports D 1 -D 4 are coupled to ports 
M 5 -M 8 in accordance with the control words provided to 
switch 406B by the controllers in directors 180^ 180 3 , 180 5 , 
180 7 on busses R^-R^, as indicated. The signals on buses 
R A1 -R A4 are request signals. Thus, port 402 A of any one of 
the directors 180^ 180 3 , 180 5 , 180 7 may be coupled to any 
one of the ports Mj-M 4 of switch 406A, selectively in 
accordance with the request signals on buses R^ 1 -R A4 . 
Likewise, port 402B of any one of the directors 180J-1804 
may be coupled to any one of the ports M 5 -M 8 of switch 
406B, selectively in accordance with the request signals on 
buses Rfii-R^- The coupling between the director boards 
190^1908, 210^2108 and the global cache memory 220 is 
shown in FIG. 8. 

More particularly, and referring also to FIG. 2, as noted 
above, each one of the host computer processors 121 3 -121 32 
in the host computer 120 is coupled to a pair of the front-end 
directors 180^180^ to provide redundancy in the event of 
a failure in one of the front end-directors 181J-18132 
coupled thereto. Likewise, the bank of disk drives 140 has 
a plurality of, here 32, disk drives 141 3 -141 32 , each disk 
drive 141 .^Ml^ being coupled to a pair of the back-end 
directors 200 a -200 32 to provide redundancy in the event of 
a failure in one of the back-end directors 200^20032 
coupled thereto). Thus, considering exemplary host com- 
puter processor 121 1 , such processor 121 a is coupled to a 
pair of front-end directors 180 1? 180 2 . Thus, if director 180! 
fails, the host computer processor 121 x can still access the 
system interface 160, albeit by the other front-end director 
180 2 . Thus, directors 180 2 and 180 2 are considered redun- 
dancy pairs of directors. Likewise, other redundancy pairs of 
front-end directors are: front-end directors 180 3 , 180 4 ; 180 5 , 

180 6 . 180 7 , 180 8 . 180 9 , 180 10 ; 180 u , 180 12 ; 180 13 , 180 J4 . 
180,5 180 i6 iSO^, 180 18 . 180 19 , 180 2o ; 180 21 , 180^ 
180 23 ; 180 24; 180 25 , 180 2e ; i80 27 , 180 28 ; 180 2S >, 180 30 ; and 
180 3J 180 32 (only directors 180 3J and 180 32 being shown in 
FIG. 2). 

Likewise, disk drive 141 j is coupled to a pair of back-end 
directors 200 l7 200 2 . Thus, if director 200 1 fails, the disk 
drive 141 , can still access the system interface 160, albeit by 
the other back-end director 180 2 . Thus, directors 200 2 and 
200 2 arc considered redundancy pairs of directors. Likewise, 
other redundancy pairs of back-end directors are: back-end 
directors 200 3 , 200 4 ; 200 5 , 200 e . 200 7 200 8 . 200 9 200 10 ; 
200 n , 200 12 ; 200 13 , 200 14 ; 200 15 200 16 . 200 17 , 200 18 . 
200 19 , 200 20 ; 200 21 , 200 22 . 200 23 ' 200 24; ' 200 2S , 200 26 i 
200 27 , 200 28 ; 200 29 , 200 30 ; and 200 31t 200 32 (only directors 
200 31 and 200 32 being shown in FIG. 2). Further, referring 
also to FIG. 8, the global cache memory 220 includes a 
plurality of, here eight, cache memory boards 200-,-200 8 , as 
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shown. Still further, referring to FIG. 8 A, an exemplary one 
of the cache memory boards, here board 220 3 is shown in 
detail and is described in detail in U.S. Pat. No. 5,943,287 
entitled "Fault Tolerant Memory System", John K. Walton, 
inventor, issued Aug. 24, 1999 and assigned to the same s 
assignee as the present invention, the entire subject matter 
therein being incorporated herein by reference. Thus, as 
shown in FIG. 8 A, the board 220 3 includes a plurality of, 
here four RAM memory arrays, each one of the arrays has 
a pair of redundant ports, i.e., an A port and a B port. The 10 
board itself has sixteen ports; a set of eight A ports M A1 -M A8 
and a set of eight B ports M B1 -M BQ , Four of the eight A port, 
here A ports M A3 -M A4 are coupled to the M 3 port of each of 
the front-end director boards 190 3 , 190 3 , 190 5 , and 190 7 , 
respectively, as indicated in FIG. 8. Four of the eight B port, 35 
here B ports M B1 -M BA are coupled to the M 3 port of each of 
the front-end director boards 190 2 , 190 4 , 190 6 , and 190 8 , 
respectively, as indicated in FIG. 8. The other four of the 
eight A port, here A ports M A5 -M A8 are coupled to the M ± 
port of each of the back-end director boards 210 3 , 210 3 , 2 o 
210 5 , and 210 7 , respectively, as indicated in FIG. 8. The 
other four of the eight B port, here B ports M^-M^ are 
coupled to the M 3 port of each of the back-end director 
boards 210 2 , 210 4 , 210 6 , and 21 0 8 , respectively, as indicated 
in FIG. 8 25 

Considering the exemplary four A ports M A1 -M A4 , each 
one of the four A ports M A1 -M A4 can be coupled to the A 
port of any one of the memory arrays through the logic 
network 22\ XA . Thus, considering port M A1 , such port can be 
coupled to the A port of the four memory arrays. Likewise, 30 
considering the four A ports M A5 -M A8 , each one of the four 
A ports M A5 -M A8 can be coupled to the A port of any one 
of the memory arrays through the logic network 221 lB . 
Likewise, considering the four B ports M BX -bA BA , each one 
of the four B ports M 51 -M J4 can be coupled to the B port 35 
of any one of the memory arrays through logic network 
221 1B . Likewise, considering the four B ports M^-M^, 
each one of the four B ports M B $-M B8 can be coupled to the 
B port of any one of the memory arrays through the logic 
network 221^. Thus, considering port M B1 , such port can be 40 
coupled to the B port of the four memory arrays. Thus, there 
are two paths data and control from either a front-end 
director lSQ^lSQ^ or a back-end director 200 1 -200 32 can 
reach each one of the four memory arrays on the memory 
board. Thus, there are eight sets of redundant ports on a 45 
memory board, i.e., ports M A1 , M B1 ; M A2 , M^; M^ 3 , M B3 ; 
M A4» m j?4J M^ 5 , M B5 ; M A6 , M B6 ; M A7 , M B7 ; and M A8 , M BS . 
Further, as noted above each one of the directors has a pair 
of redundant ports, i.e. a 402 A port and a 402 B port (FIG. 
7). Thus, for each pair of redundant directors, the A port (i.e., 50 
port 402A) of one of the directors in the pair is connected to 
one of the pair of redundant memory ports and the B port 
(i.e., 402B) of the other one of the directors in such pair is 
connected to the other one of the pair of redundant memory 
ports. 55 

More particularly, referring to FIG. 8B, an exemplary pair 
of redundant directors is shown, here, for example, front-end 
director 180 3 and front end-director 180 2 . It is first noted that 
the directors 180!, 180 2 in each redundant pair of directors 
must be on different director boards, here boards 190 3 , 190 2 , 60 
respectively. Thus, here front-end director boards 190 3 -190 8 
have thereon: front-end directors 180 2 , 180 3 , 180 5 and 180 7 ; 
front-end directors 180 2 , 180 4 , 180 6 and 180 8 ; front end 
directors 180 9 , 180 13 , 180 i3 and 180 15 ; front end directors 
180 10 , 180 12 , 180 14 and 180 36 ; front-end directors 180 17 , 65 
180 19 , 180 21 and 180^; front-end directors 180 18 , 180 20 , 
180 22 and 180 24 ; front-end directors 180^, 180 27 , 180 29 and 



180 33 ; front-end directors 180 18 , 180 20 , 180 22 and 180 24 . 
Thus, here back-end director boards 210 3 -210 8 have 
thereon: back-end directors 200 3 , 200 3 , 200 5 and 200 7 ; 
back-end directors 200 2 , 200 4 , 200 6 and 200 8 ; back-end 
directors 200 9 , 200 u , 200 13 and 200 15 ; back-end directors 
200 3O , 200 12 , 200 14 and 200 26 ; back-end directors 200 17 , 
200 19 , 200 21 and 200^; back-end directors 200 38 , 200^, 
200 22 and 200 24 ; back-end directors 200 25 , 200 27 , 200 29 and 
200 31 ; back-end directors 200 I8 , 200 20 , 200 22 and 200 24 ; 

Thus, here front-end director 180 3 , shown in FIG. 8 A, is 
on front-end director board 190 3 and its redundant front-end 
director 180 2 , shown in FIG. 8B, is on another front-end 
director board, here for example, front-end director board 
190 2 . As described above, the port 402A of the quad port 
RAM 402 (i.e., the A port referred to above) is connected to 
switch 406 A of crossbar switch 318 and the port 402B of the 
quad port RAM 402 (i.e., the B port referred to above) is 
connected to switch 406B of crossbar switch 318. Likewise, 
for redundant director 180 2 , However, the ports M 3 -M 4 of 
switch 406A of director 180 3 are connected to the M A1 ports 
of global cache memory boards 220 3 -200 4 , as shown, while 
for its redundancy director 180 2 , the ports M 3 -M 4 of switch 
406 A are connected to the redundant M B1 ports of global 
cache memory boards 220 3 -200 4 , as shown. 

Referring in more detail to the crossbar switch 318 (FIG. 
7), as noted above, each one of the director boards 
190 1 -210 s has such a switch 318 and such switch 318 
includes a pair of switches 406 A, 406 B. Each one of the 
switches 406 A, 406B is identical in construction, an exem- 
plary one thereof, here switch 406A being shown in detail in 
FIG. 8C. Thus switch 406 A includes four input/output 
director-side ports D 3 -D 4 as described in connection with 
exemplary director board 190 v Thus, for the director board 
190 3 shown in FIG. 7, the four input/output director-side 
ports Dj-D 4 of switch 406 A are each coupled to the port 
402 A of a corresponding one of the directors 180 1? 180 3 , 
180 5 , and 180 7 on the director board 190 r 

Referring again to FIG. 8C, the exemplary switch 406 A 
includes a plurality of, here four, switch sections 430 3 -430 4 . 
Each one of the switch sections 430 3 -430 4 is identical in 
construction and is coupled between a corresponding one of 
the input/output director-side ports D 3 -D 4 and a correspond- 
ing one of the output/input memory-side ports M 3 -M 4 , 
respectively, as shown. (It should be understood that the 
output/input memory-side ports of switch 406B (FIG. 7) are 
designated as ports M 5 -M 8 , as shown. It should also be 
understood that while switch 406A is responsive to request 
signals on busses R A3 -R A4 from quad port controller 404 in 
directors 180 3 , 180 3 , 180 5 , 180 7 (FIG. 7), switch 406B is 
responsive in like manner to request signals on busses 
R^ 1 -R J94 from controller 404 in directors 180 3 , 180 3 , 180 5 
and 180 7 ). More particularly, controller 404 of director 1S0 1 
produces request signals on busses R A1 or R Bi . In like 
manner, controller 404 of director 180 3 produces request 
signals on busses R A2 or R^, controller 404 of director 180 5 
produces request signals on busses R A3 or R^ 3 , and control- 
ler 404 of direction 180 7 produces request signals on busses 
or Re- 
considering exemplary switch section 430 3 , such switch 
section 403 3 is shown in FIG. 8C to include a FIFO 432 fed 
by the request signal on bus R M . (It should be understood 
that the FIFOs, not shown, in switch sections 430 2 -430 4 are 
fed by request signals R A2 -R A4 , respectively). The switch 
section 406 3 also includes a request generation 434, and 
arbiter 436, and selectors 442 and 446, all arranged as 
shown. The data at the memory -side ports M 3 -M 4 are on 
busses DM1-DM4 are fed as inputs to selector 446. Also fed 
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to selector 446 is a control signal produced by the request 
generator on bus 449 in response to the request signal R A1 
stored in FIFO 432. The control signal on bus 449 indicates 
to the selector 446 the one of the memory-side ports M 2 -M 4 
which is to be coupled to director-side port D v The other 5 
switch sections 430 2 -430 4 operate in like manner with 
regard to director-side ports Dj-D 4 , respectively and the 
memory-side ports M a -M 4 . 

It is to be noted that the data portion of the word at port 
D a (i.e., the word on bus DDI) is also coupled to the other 10 
switch sections 430 2 -430 4 . It is further noted that the data 
portion of the words at ports D 2 -D 4 (i.e., the words on 
busses DD2-DD4, respectively), are fed to the switch sec- 
tions 430 i -430 4 , as indicated. That is, each one of the switch 
sections 430 2 -430 4 has the data portion of the words on 35 
ports D a -D 4 (i.e., busses DD1-DD4), as indicated. It is also 
noted that the data portion of the word at port M 1 (i.e., the 
word on bus DM1) is also coupled to the other switch 
sections 430 2 -430 4 . It if further noted that the data portion 
of the words at ports M 2 -M 4 (i.e., the words on busses 2 q 
DM2-DM4, respectively), are fed to the switch sections 
430 2 -430 4 , as indicated. That is, each one of the switch 
sections 430 1 -430 4 has the data portion of the words on 
ports MJ-M4 (i.e., busses DM1-DM4), as indicated. 

As will be described in more detail below, a request on 2 s 
bus R A1 to switch section 430j is a request from the director 
180 1 which identifies the one of the four ports M 3 -M 4 in 
switch 430 3 is to be coupled to port 402A of director 180j 
(director side port D x ). Thus, port 402A of director 180 a may 
be coupled to one of the memory side ports M 1 -M 4 selec- 30 
lively in accordance with the data on bus R A1 . Likewise, a 
request on buses R A2 , R A3 , R A4 to switch section 430 2 -430 4 , 
respectively, are requests from the directors 180 3 , 180 5 , and 
180 7 , respectively, which identifies the one of the four ports 
M 1 -M 4 in switch 430 2 -430 4 is to be coupled to port 402A 35 
of directors 180 3 , 180 5 and 180 7 , respectively. 

More particularly, the requests R A1 are stored as they are 
produced by the quad port RAM controller 440 (FIG. 7) in 
receive FIFO 432. The request generator 434 receives from 
FIFO 432 the requests and determines which one of the four 40 
memory-side ports M 2 -M 4 is to be coupled to port 402Aof 
director These requests for memory-side ports M a -M 4 
are produced on lines RA1,1-RA1,4, respectively. Thus, in 
line RA1,1 (i.e., the request for memory side port M x ) is fed 
to arbiter 436 and the requests from switch sections 45 
430 2 -430 4 (which are coupled to port 402 A of directors 
I8O3, 180 5 , and 180 7 ) on line RA2,1, RA3,1 and RA4,1, 
respectively are also fed to the arbiter 436, as indicated. The 
arbiter 436 resolves multiple requests for memory -side port 
Mi on a first come-first serve basis. The arbiter 436 then 50 
produces a control signal on bus 435 indicating the one of 
the directors 180 1( 180 3 , 180 5 or 180 7 which is to be coupled 
to memory-side port 

The control signal on bus 435 is fed to selector 442. Also 
fed to selector 442 are the data portion of the data at port D 1? 55 
i.e., the data of data bus DDI) along with the data portion of 
the data at ports D 2 -D 4 , i.e., the data on data busses 
DD2-DD4, respectively, as indicated. Thus, the control 
signal on bus 435 causes the selector 442 to couple to the 
output thereof the data busses DD1-DD4 from the one of the 60 
directors 180^ 180 3 , 180 5 , 180 7 being granted access to 
memory-side port M a by the arbiter 436. The selected 
outputs of selector 442 is coupled to memory-side port M r 
It should be noted that when the arbiter 436 receives a 
request via the signals on lines RA1,1, RA2,1, RA3,1 and 65 
RA4,1, acknowledgements are returned by the arbiter 436 
via acknowledgement signals on line AK1,1, Akl,2, AK1,3, 



AK1,4, respectively such signals being fed to the request 
generators 434 in switch section 430 3 , 430 2 , 430 3 , 430 4 , 
respectively. 

Thus, the data on any port Dj-D,; can be coupled to and 
one of the ports M 1 -M 4 to effectuate the point-to-point data 
paths P a -P 64 described above in connection with FIG. 2. 

Referring again to FIG. 7, data from host computer 120 
(FIG. 2) is presented to the system interface 160 (FIG. 2) in 
batches from many host computer processors 121^121^. 
Thus, the data from the host computer processors 
121-L-12132 are interleaved with each other as they are 
presented to a director lSOi-lSO-^. The batch from each 
host computer processor ISOj-ISO^ (i.e., source) is tagged 
by the protocol translator 400. More particularly by a 
Tacheon ASIC in the case of a fibre channel connection. The 
controller 404 has a look-up table formed during initializa- 
tion. As the data comes into the protocol translator 400 and 
is put into the quad port RAM 420 under the control of 
controller 404, the protocol translator 400 informs the con- 
troller that the data is in the quad port RAM 420. The 
controller 404 looks at the configuration of its look-up table 
to determine the global cache memory 220 location (e.g., 
cache memory board 220 a -220g) the data is to be stored 
into. The controller 404 thus produces the request signals on 
the appropriate bus R A1 , R B1 , and then tells the quad port 
RAM 402 that there is a block of data at a particular location 
in the quad port RAM 402, move it to the particular location 
in the global cache memory 220. The crossbar switch 318 
also takes a look at what other controllers 404 in the 
directors 180 3 , 180 5 , and 180 7 on that particular director 
board 190 1 are asking by making request signal on busses 
R A2 > R/f2> ^A3» Rfl3> R*4» Rb4> respectively. The arbitration 
of multiple requests is handled by the arbiter 436 as 
described above in connection with FIG. 8C. 

Referring again to FIG. 7, the exemplary director 180 2 is 
shown to include in the message engine/CPU controller 314. 
The message engine/CPU controller 314 is contained in a 
field programmable gate array (FPGA), The message engine 
(ME) 315 is coupled to the CPU bus 317 and the DMA 
section 408 as shown. The message engine (ME) 315 
includes a Direct Memory Access (DMA) section 408, a 
message engine (ME) state machine 410, a transmit buffer 
424 and receive buffer 424, a MAC packet izer/depacketizer 
428, send and receive pointer registers 420, and a parity 
generator 321. The DMA section 408 includes a DMA 
transmitter 418, shown and to be described below in detail 
in connection with FIG. 9, and a DMA receiver 424, shown 
and to be described below in detail in connection with FIG. 
10, each of which is coupled to the CPU bus interface 317, 
as shown in FIG. 7. The message engine (ME) 315 includes 
a transmit data buffer 422 coupled to the DMA transmitter 
418, a receive data buffer 424 coupled to the DMA receiver 
421, registers 420 coupled to the CPU bus 317 through an 
address decoder 401, the packetizer/de-packetizer 428, 
described above, coupled to the transmit data buffer 422, the 
receive data buffer 424 and the crossbar switch 320, as 
shown, and a parity generator 321 coupled between the 
transmit data buffer 422 and the crossbar switch 320. More 
particularly, the packetizer portion 428P is used to packetize 
the message payload into a MAC packet (FIG. 2B) passing 
from the transmit data buffer 422 to the crossbar switch 320 
and the de -packetizer portion 428D is used to de-packetize 
the MAC packet into message payload data passing from the 
crossbar switch 320 to the receive data buffer 424. The 
packetizalion is here performed by a MAC core which 
builds a MAC packet and appends to each message such 
things as a source and destination address designation indi- 
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eating the director sending and receiving the message and a 
cyclic redundancy check (CRC), as described above. The 
message engine (ME) 315 also includes: a receive write 
pointer 450, a receive read pointer 452; a send write pointer 
454, and a send read pointer 456. 

Referring now to FIGS. 11 and 12, the transmission of a 
message from a director 180J-18032, 200j-200 32 and the 
reception of a message by a director 210^21032, here 
exemplary director 1S0 1 shown in FIG. 7) will be described. 
Considering first transmission of a message, reference is 
made to FIGS. 7 and 11. First, as noted above, at power-up 
the controller 306 (FIG. 5) of both message network boards 
304 2 , 304 2 initialize the message routing mapping described 
above for the switches 308^-3084 in switch section 308 and 
for the crossbar switches 320. As noted above, a request is 
made by the host computer 120. The request is sent to the 
protocol translator 400. The protocol translator 400 sends 
the request to the microprocessor 299 via CPU bus 317 and 
buffer 301. When the CPU 310 (FIG. 7) in the micropro- 
cessor 299 of exemplary director 180 j determines that a 
message is to be sent to another one of the directors 
180 2 -180 32 , 200 3 -200 32 , (e.g., the CPU 310 determines that 
there has been a "miss" in the global cache memory 220 
(FIG. 2) and wants to send a message to the appropriate one 
of the back-end directors 200^20032, as described above in 
connection with FIG. 2), the CPU 310 builds a 64 byte 
descriptor (FIG. 2A) which includes a 32 byte message 
payload indicating the addresses of the batch of data to be 
read from the bank of disk drives 140 (FIG, 2) (Step 500) 
and a 32 byte command field (Step 510) which indicates the 
message destination via an 8-byte bit vector, i.e., the 
director, or directors, which are to receive the message. An 
8-byte portion of the command field indicates the director or 
directors, which are to receive the message. That is, each one 
of the 64 bits in the 8-byte portion corresponds to one of the 
64 directors. Here, a logic 1 in a bit indicates that the 
corresponding director is to receive a message and a logic 0 
indicates that such corresponding director is not to receive 
the message. Thus, if the 8-byte word has more than one 
logic 1 bit more than one director will receive the same 
message. As will be described, the same message will not be 
sent in parallel to all such directors but rather the same 
message will be sent sequentially to all such directors. In any 
event, the resulting 64-byte descriptor is generated by the 
CPU 310 (FIG. 7) (Step 512) is written into the RAM 312 
(Step 514), as shown in FIG. 11. 

More particularly, the RAM 512 includes a pair of queues; 
a send queue and a receive queue, as shown in FIG. 7. The 
RAM 312 is coupled to the CPU bus 317 through an Error 
Detection and Correction (EDAC)/Memory control section 
303, as shown. The CPU 310 then indicates to the message 
engine (ME) 315 state machine 410 (FIG. 7) that a descrip- 
tor has been wriiien into the RAM 312. It should be noted 
that the message engine (ME) 315 also includes: a receive 
write pointer or counter 450, the receive read pointer or 
counter 452, the send write pointer or counter 454, and the 
send read pointer or counter 454, shown in FIG. 7. All four 
pointers 450, 452, 454 and 456 are reset to zero on power- 
up. As is also noted above, the message engine/CPU con- 
troller 314 also includes: the de-packetizer portion 428D of 
packetizer/de-packetizer 428, coupled to the receive data 
buffer 424 (FIG. 7) and a packetizer portion 428P of the 
packetizer/de-packetizer 428, coupled to the transmit data 
buffer 422 (FIG. 7). Thus, referring again to FIG. 11, when 
the CPU 310 indicates that a descriptor has been written into 
the RAM 312 and is now ready to be sent, the CPU 310 
increments the send write pointer and sends it to the send 
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write pointer register 454 via the register decoder 401, Thus, 
the contents of the send write pointer register 454 indicates 
the number of messages in the send queue 312S of RAM 
312, which have not been sent. The state machine 410 

5 checks the send write pointer register 454 and the send read 
pointer register 456, Step 518. As noted above, both the send 
write pointer register 454 and the send read pointer register 
456 are initially reset to zero during power-up. Thus, if the 
send read pointer register 456 and the send write pointer 

10 register 454 are different, the state machine knows that there 
is a message is in RAM 312 and that such message is ready 
for transmission. If a message is to be sent, the state machine 
410 initiates a transfer of the stored 64-byte descriptor to the 
message engine (ME) 315 via the DMA transmitter 418, 

15 FIG. 7 (Steps 520, 522). The descriptor is sent from the send 
queues 312S in RAM 312 until the send read pointer 456 is 
equal to the send write pointer 454. 

As described above in connection with Step 510, the CPU 
310 generates a destination vector indicating the director, or 

20 directors, which are to receive the message. As also indi- 
cated above the command field is 32-bytes, eight bytes 
thereof having a bit representing a corresponding one of the 
64 directors to receive the message. For example, referring 
to FIG. 11C, each of the bit positions 1-64 represents 

25 directors 180J-18032, 200^2003^ respectively. Here, in this 
example, because a logic 1 is only in bit position 1, the 
eight-byte vector indicates that the destination director is 
only front-end director 108 2 . In the example in FIG. 11D, 
because a logic 1 is only in bit position 2, the eight-byte 

30 vector indicates that the destination director is only front- 
end director 108 2 . In the example in FIG. HE, because a 
logic 1 is more than one bit position, the destination for the 
message is to more than one director, i.e., a multi-cast 
message. In the example in FIG. 11E, a logic 1 is only in bit 

35 positions 2, 3, 63 and 64. Thus, the eight-byte vector 
indicates that the destination directors are only front-end 
director 108 2 and 108 3 and back-end directors 200 31 and 
200 32 . There is a mask vector stored in a register of register 
section 420 (FIG. 7) in the message engine (ME) 315 which 

40 identifies director or directors which may be not available to 
use (e.g. a defective director or a director not in the system 
at that time), Step 524, 525, for a uni-cast transmission). If 
the message engine (ME) 315 state machine 410 indicates 
that the director is available by examining the transmit 

45 vector mask (FIG. 11 F) stored in register 420, the message 
engine (ME) 315 encapsulates the message payload with a 
MAC header and CRC inside the packetizer portion 428P, 
discussed above (Step 526). An example of the mask is 
shown in FIG. 11F. The mask has 64 bit positions, one for 

50 each one of the directors. Thus, as with the destination 
vectors described above in connection with FIGS. 11C-11E, 
bit positions 1-64 represents directors 180 x -180 32 . 
200 3 -200 32 , respectively. Here in this example, a logic 1 in 
a bit position in the mask indicates that the representative 

55 director is available and a logic 0 in such bit position 
indicates that the representative director is not available. 
Here, in the example shown in FIG. 11 F, only director 200 32 
is unavailable. Thus, if the message has a destination vector 
as indicated in FIG. 11E, the destination vector, after passing 

60 through the mask of FIG. 11 F modifies the destination vector 
to that shown in FIG. 11G. Thus, director 200 32 will not 
receive the message. Such mask modification to the desti- 
nation vector is important because, as will be described, the 
messages on a multi-cast are sent sequentially and not in 

65 parallel. Thus, elimination of message transmission to an 
unavailable director or directors increases the message trans- 
mission efficiency of the system. 
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Having packetized the message into a MAC packet via the S in the crossbar switch 320. The selector S is responsive to 

packetizer portion of the packetizer/de-packetizer 428 (FIG. the bit B in the header of the MAC packet (FIG. 2B) and, 

7), the message engine (ME) 315 transfers the MAC packet when such bit B is one logic state the data is coupled to one 

to the crossbar switch 320 (Step 528) and the MAC packet of the message networks boards 402A and in response to the 

is routed to the destination by the message network 260 5 opposite logic state the data is coupled to the other one of the 

(Step 530) via message network boards 304 2 , 304 2 or on the message networks boards 402B. That is, when one message 

same director board via the crossbar switch 320 on such is transmitted to board 304 a the next message is transmitted 

director board. t0 t, oarc j 394^ 

Referring to FIG. 12, the message read operation is Referring again to FIG. 9, the details of an exemplary 

described. Thus, in Step 600 the director waits for a mes- 10 transmit DMA 418 is shown. As noted above, after a 

sage. When a message is received, the message engine (ME) decriptor has been created by the CPU 310 (FIG. 7) and is 

315 state machine 410 receives the packet (Step 602), The then stored in the RAM 312. If the send write pointer 450 

state machine 410 checks the receive bit vector mask (FIG. (FIG. 7) and send read pointer 452, described above, have 

11 stored in register 426) against the source address of the different counts an indication is provided by the' state 

packet (Step 604). If the state machine 410 determines that 15 machine 410 in the message engine (ME) 315 (FIG. 7) that 

the message is from an improper source (i.e., a faulty the created descriptor is available for DMA transmission to 

director as indicated in the mask, FIG. 11F, for example), the the message engine (ME) 315, the payload off the descriptor 

packet is discarded (Step 606). On the other hand, if the state is packetized into a MAC packet and sent through the 

machine 410 determines that the packet is from a proper or message network 360 (FIG. 2) to one or more directors 

valid director (i.e., source), the message engine (ME) 315 20 180 3 -180 32 , 200!-200 32 . More particularly, the descriptor 

de-encapsulates the message from the packet (Step 608) in created by the CPU 310 is first stored in the local cache 

de-packetizer 428D. The state machine 410 in the message memory 319 and is later transferred to the send queue 312S 

engine (ME) 315 initiates a 32-byte payload transfer via the in RAM 312. When the send write pointer 450 and send read 

DMA receive operation (Step 610). The DMA writes the 32 pointer 452 have different counts, the message engine (ME) 

byte message to the memory receive queue 313R in the 25 315 state machine 410 initiates a DMA transmission as 

RAM 312 (Step 612). The message engine (ME) 315 state discussed above in connection with Step 520 (FIG. 11). 

machine 410 then increments the receive write pointer Further, as noted above, the descriptor resides in send 

register 450 (Step 614). The CPU 310 then checks whether queues 312R within the RAM 312. Further, as noted above, 

the receive write pointer 50 is equal to the receive read each descriptor which contains the message is a fixed size, 

pointer 452 (Step 616). If they are equal, such condition 30 here 64-bytes. As each new, non-transmitted descriptor is 

indicates to the CPU 310 that a message has not been created by the CPU 310, it is sequentially stored in a 

received (Step 618). On the other hand, if the receive write sequential location, or address in the send queue 312S. Here, 

pointer 450 and the receive read pointer 452 are not equal, the address is a 32-bit address. 

such condition indicates to the CPU 310 that a message has when the transmit DMA is initiated, the state machine 

been received and the CPU 310 processes the message in the 35 410 i n the message engine (ME) 315 (FIG. 7), sends the 

receive queue 314R of RAM 312 and then the CPU 310 queue address on bus 411 to an address register 413 in the 

increments the receive read pointer and writes it into the DMA transmitter 418 (FIG. 9) along with a transmit write 

receive read pointer register 452. Thus, messages are stored ena bl e signal Tx_WE signal. The DMA transmitter 418 

m the receive queue 312R of RAM 312 until the contents of requests the CPU bus 317 by asserting a signal on Xrnit__Br. 

the receive read pointer 452 and the contents of the receive 40 Th e CPU bus arbiter 414 (FIG. 7) performs a bus arbitration 

write pointer 450, which are initialized to zero during and when appropriate the arbiter 414 grants the DMA 

power-up, are equal. transmitter 418 access to the CPU bus 317. The Xmit Cpu 

Referring now to FIG. 13, the acknowledgement of a state machine 419 then places the address currently available 

message operation is described. In Step 700 the receive in the address register 413 on the Address bus portion 3 17A 

DMA engine 420 successfully completes a message transfer 45 of CPU bus 317 by loading the output address register 403. 

to the receive queue in RAM 312 (FIG. 7). The state Odd parity is generated by a Parity generator 405 before 

machine 410 in the message engine (ME) 315 generates an loading the output address register 403. The address in 

acknowledgement MAC packet and transmits the MAC register 403 is placed on the CPU bus 317 (FIG. 7) for RAM 

packet to the sending director via the message network 260 312 send queue 312S, along with appropriate read control 

(FIG. 2) (Steps 702, 704). The message engine (ME) 315 at 50 signals via CPU bus 317 portion 317C. The data at the 

the sending director denencapsulates a 16 byte status payload address from the RAM 312 passes, via the data bus portion 

in the acknowledgement MAC packet and transfers such 317D of CPU bus 317, through a parity checker 415 to a data 

status payload via a receive DMA.operation (Step 706), The input register 417. The control signals from the CPU 310 are 

DMA of the sending (i.e., source) director writes to a status fed to a Xmit CPU state machine 419 via CPU bus 317 bus 

field of the descriptor within the RAM memory send queue 55 portion 317C. One of the control signals indicates whether 

314S (Step 708).The state machine 410 of the message the most recent copy of the requested descriptor is in the 

engine (ME) 315 of the sending director (which received the send queue 312S of the RAM 312 or still resident in the local 

acknowledgement message) increments its send read pointer cache memory 319. That is, the most recent descriptor at any 

454 (Step 712). The CPU 310 of the sending director (which given address is first formed by the CPU 310 in the local 

received the acknowledgement message) processes the 6 o cache memory 319 and is later transferred by the CPU 310 

descriptor status and removes the descriptor from the send t 0 the queue in the RAM 312. Thus, there may be two 

queue 312S of RAM 312 (Step 714). It should be noted that descriptors with the same address; one in the RAM 312 and 

the send and receive queues 312S and 312R are each circular one in the local cache memory 319 (FIG. 7), the most recent 

9 ueues one being in the local cache memory 319. In either event, the 

As noted above, the MAC packets are, during normal 65 transmit DMA 418 must obtain the descriptor for DMA 

operation, transmitted alternatively to one of the pair of transmission from the RAM 312 and this descriptor is stored 

message network boards 304 3 , 304 2 by hardware a selector in the transmit buffer register 421 using signal 402 produced 
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by the state machine 419 to load these registers 421. The 
control signal from the CPU 310 to the Xmit CPU state 
machine 419 indicates whether the most recent descriptor is 
in the local cache memory 319. If the most recent descriptor 
is in the local cache memory 319, the Xmit CPU state 5 
machine 419 inhibits the data that was just read from send 
queue 312S in the RAM 312 and which has been stored in 
register 421 from passing to selector 423. In such case, state 
machine 419 must perform another data transfer at the same 
address location. The most recent message is then trans- 10 
f erred by the CPU 310 from the local cache memory 319 to 
the send queue 312S in the RAM 312. The transmit message 
state machine 419 then re-arbitrates for the CPU bus 317 and 
after it is granted such CPU bus 317, the Xmit CPU state 
machine 419 then reads the descriptor from the RAM 312. 15 
This time, however, there the most recent descriptor is 
available in the send queue 312s in the RAM 312. The 
descriptor in the RAM 312 is now loaded into the transmit 
buffer register 421 in response to the assertion of the signal 
402 by the Xmit CPU state machine 419; The descriptor in 20 
the register 421 is then transferred through selector 423 to 
message bus interface 409 under the control of a Xmit 
message (msg) state machine 427. That is, the descriptor in 
the transmit buffer register 421 is transferred to the transmit 
data buffer 422 (FIG. 7) over the 32 bit transmit message bus 25 
interface 409 by the Xmit message (msg) state machine 427. 
The data in the transmit data buffer 422 (FIG. 7) is pack- 
etized by the packetizer section of the packetizer/de- 
packetizer 428 as described in Step 530 in FIG. 11. 

More particularly, and referring also to FIG. 14 A, the 30 
method of operating the transmit DMA 418 (DIG. 9) is 
shown. As noted above, each descriptor is 64-byte. Here, the 
transfer of the descriptor takes place over two interfaces 
namely, the CPU bus 317 and the transmit message interface 
bus 409 (FIG. 7). The CPU bus 317 is 64 bits wide and eight, 35 
64-bit double -words constitute a 64-byte descriptor. The 
Xmit CPU state machine 419 generates the control signals 
which result in the transfer of the descriptor from the RAM 
312 into the transmit buffer register 421 (FIG. 7). The 
64-byte descriptor is transferred in two 32-byte burst 40 
accesses on the CPU bus 317. Each one of the eight double 
words is stored sequentially in the transmit buffer register 
421 (FIG. 9). Thus, in Step 800, the message engine 315 
state machine 410 loads the transmit DMA address register 
413 with the address of the descriptor to be transmitted in the 45 
send queue 312S in RAM 312. This is done by the asserting 
the Tx_WE signal and thus puts Xmit CPU state machine 
419 in step 800, loads the address register 413 and proceeds 
to step 802. In step 802, the Xmit Cpu state machine 419 
loads the CPU transfer counter 431 (FIG. 9) with a 32-byte 50 
count, which is 2. This is the number of 32 byte transfers that 
would be required to transfer the 64-byte descriptor, Step 
802. The Xmit Cpu state machine 419 now proceeds to Step 
804. In step 804, the transmit DMA state machine 419 
checks the validity of the address that is loaded into its 55 
address register 413. The address loaded into the address 
register 413 is checked against the values loaded into the 
memory address registers 435. The memory address regis- 
ters 435 contain the base address and the offset of the send 
queue 312s in the RAM 312. The sum of the base address 60 
and the offset is the range of addresses for the send queue 
312S in RAM 312. The address check circuitry 437 con- 
stantly checks whether the address in the address register 
413 is with in the range of the send queue 312S in the RAM 
312. If the address is found to be outside the range of the 65 
send queue 312S the transfer is aborted, this status is stored 
in the status register 404 and then passed back to the 



message engine 315 state machine 410 in Step 416. The 
check for valid addresses is done in Step 805. If the address 
is within the range, i.e., valid, the transmit DMA state 
machine 419 proceeds with the transfer and proceeds to Step 
806, In the step 806, the transmit DMA state machine 419 
requests the CPU bus 317 by asserting the Xmit_BR signal 
to the arbiter 414 and then proceeds to Step 807. In Step 807, 
the Xmit Cpu state machine 419 constantly checks if it has 
been granted the bus by the arbiter. When the CPU bus 317 
is granted, the Xmit CPU state machine proceeds to Step 
808. In Step 808, the Xmit Cpu state machine 419 generates 
an address and a data cycle which essentially reads 32-bytes 
of the descriptor from the send queue 312S in the RAM 312 
into its transmit buffer register 421. The Xmit Cpu state 
machine 419 now proceeds to step 810. In Step 810^ the 
Xmit Cpu state machine 419 loads the descriptor that was 
read into its buffer registers 421 and proceeds to Step 811. 
In Step 811, a check is made for any local cache memory 319 
coherency errors (i.e., checks whether the most recent data 
is in the cache memory 319 and not in the RAM 312) on 
these 32-bytes of data. If this data is detected to be resident 
in the local CPU cache memory 319, then the Xmit Cpu state 
machine 419 discards this data and proceeds to Step 806. 
The Xmit Cpu state machine 419 now requests for the CPU 
bus 317 again and when granted, transfers another 32-bytes 
of data into the transmit buffer register 421, by which time 
the CPU has already transferred the latest copy of the 
descriptor into the RAM 312. In cases when the 32-bytes of 
the descriptor initially fetched from the RAM 312 was not 
resident in the local CPU cache memory 319 (i.e., if no 
cache coherency errors were detected), the Xmit Cpu state 
machine 419 proceeds to Step 812. In Step 812, the Xmit 
CPU state machine 419 decrements counters 431 and incre- 
ments the address register 413 so that such address register 
413 points to the next address. The Xmit Cpu state machine 
then proceeds to step 814. When in Step 814, the Transmit 
CPU state machine 419 checks to see if the transfer counter 
431 has expired, i.e., counted to zero, if the count was found 
to be non-zero, it then, proceeds to Step 804 to start the 
transfer of the next 32-bytes of the descriptor. In case the 
counter 431 is zero, the process goes to Step 816 to complete 
the transfer. The successful transfer of the second 32-bytes 
of descriptor from the RAM 312 into the transmit DMA 
buffer register 421 completes the transfer over the CPU bus 
317. 

The message interface 409 is 32 bits wide and sixteen, 32 
bit words constitute a 64-byte descriptor. The 64-byte 
descriptor is transferred in batches of 32 bytes each. The 
Xmit msg state machine 427 controls and manages the 
interface 409. The Xmit Cpu state machine asserts the signal 
433 to indicate that the first 32 bytes have been successfully 
transferred over the CPU bus 317 (Step 818, FIG. 14B), this 
puis the Xmit msg state machine into Step 818 and starts the 
transfer on the message interface. In step 820, the Xmit msg 
machine 427 resets burst/transfer counters 439 and initiates 
the transfer over the message interface 409. In Step 820, the 
transfer is initiated over the message interface 409 by 
asserting the "transfer valid" (TX J)ATA_Valid) signal 
indicating to the message engine 315 state machine 410 that 
valid data is available on the bus 409. The transmit msg 
machine 427 transfers 32 bits of data on every subsequent 
clock until its burst counter in burst/transfer counter 439 
reaches a value equal to eight, Step 822. The burst counter 
in burst/transfer counter 439 is incremented with each 32-bit 
word put on the message bus 409 by a signal on line 433. 
When the burst count is eight, a check is made by the state 
machine 427 as to whether the transmit counter 431 has 
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expired, i.e., is zero, Step 824. The expiry of the transfer of the signal 464. The Receive CPU machine 462 now, 

counter in burst/transfer counter 439 indicates the 64 byte requests for the CPU bus 317 by asserting the signal 

descriptor has been transferred to the transmit buffer 422 in REC__Br. After an arbitration by CPU bus arbiter 414 (FIG. 

message engine 315. If it has expired, the transmit message 7) the receive DMA 420 (FIG. 10) is given access to the 

state machine 427 proceeds to Step 826. In step 826, the 5 CPU bus 317. The Receive CPU state machine 462 proceeds 

Xmit msg state machine asserts the output End of Transfer to transfer the data in its buffer registers 424 over the CPU 

(Tx_EOT) indicating the end of transfer over the message bus 317 into the Receive queue 312R in the RAM 312. 

bus 409. In this state, after the assertion of the Tx„EOT Simultaneously, this data is also transferred into a duplicate 

signal the status of the transfer captured in the status register buffer register 466. The data at the output of the receive 

404 is sent to the message engine 315 state machine 410. 10 buffer register 460 passes to one input of a selector 470 and 

The DMA operation is complete with the descriptor being also passes to a duplicate data receive buffer register 460. 

stored in the transmit buffer 422 (FIG. 7). The output of the duplicate receive buffer register 466 is fed 

On the other hand, if the transfer counter in burst/transfer to a second input of the selector 470. As the data is being 

counter 439 has not expired, the process goes to Step 800 transferred by the Receive CPU state machine 462, it is also 

and repeats the above described procedure to transfer the 2 nd 15 checked for cache coherency errors. If the data correspond- 

32 bytes of descriptor data, at which time the transfer will be ing to the address being written into the RAM 312, is located 

complete. in the CPU's local cache memory 319 (FIG. 7), the receive 

Referring now to FIG. 10, the receiver DMA 420 is DMA machine 420 waits for the CPU 310 to copy the old 

shown. Here, a message received from another director is to data in its local cache memory 319 back to the receive queue 

be written into the RAM 312 (FIG. 7). The receive DMA 20 312R in the RAM 312 and then overwrites this old data with 

420 is adapted to handle three types of information: error a copy of the new data from the duplicate buffer register 466. 

information which is 8 bytes in size; acknowledgement More particularly, if central processing unit 310 indicates 

information which is 16 bytes in size; and receive message to the DMA receiver 420 that the data the receive buffer 

payload and/or fabric management information which is 32 register 460 is available in the local cache memory 319, the 

byes in size. Referring also to FIG. 7, the message engine 25 receive CPU state machine 462 produces a select signal on 

315 state machine 410 asserts the Rx_WE signal, indicating line 463 which couples the data in the duplicate buffer 

to the Receive DMA 420 that is ready traasfer the Data in its register 466 to the output of selector 470 and then to the bus 

Rec buffer 416 FIG. 7. The data in the Receive buffer could 317 for store in the random access memory 312. 

be the 8-byte error information, the 16-byte Acknowledge- The successful write into the RAM 312 completes the DMA 

ment information or the 32-byte Fabric management/ 30 transfer. The receive DMA 420 then signals the message 

Receive message payload information. It places a 2 bit engine 315 state machine 410 on the status of the transfer, 

encoded receive transfer count, on the Rx_transfer count The status of the transfer is captured in the status register 

signal indicating the type of information and an address 459. 

which is the address where this information is to be stored Thus, with both the receive DMA and the transmit DMA, 

in the receive queue of RAM 312, In response to the receive 35 there is a checking of the local cache memory 319 to 

write enable signal Rx_WE, the Receive message machine determine whether it has "old" data, in the case of the 

450 (FIG. 10) loads the address into the address register 452 receive DMA or whether it has "new data" in the case of the 

and the transfer count indicating the type of information, transmit DMA. 

into the receive transfer counter 454. The address loaded Referring now to FIG, 15 A, the operation of the receive 

into the address register 452 is checked by the address check 40 DMA 420 is shown. Thus, in Step 830 the Receive message 

circuitry 456 to see if it is with in the range of the Receive machine 450 checks if the write enable signal Rx_WE is 

queue addresses, in the RAM 312. This is done by checking asserted. If found asserted, the receive DMA 420 proceeds 

the address against the values loaded into the memory to load the address register 452 and the transfer counter 454. 

registers 457 (i.e., a base address register and an offset The value loaded into the transfer counter 454 determines 

register therein). The base address register contains the, start 45 the type of DMA transfer requested by the Message engine 

address of the receive queue 3 12R residing in the RAM 312 state machine 310 in FIG. 7. The assertion of the 

and the offset register contains the size of this receive queue Rx__DATA_VALID signal is asserted. If asserted it pro- 

312R in RAM 312. Therefore the additive sum of, the values ceeds to step 836. The Rx msg state machine loads the buffer 

stored in the base address register and the offset register register 460 (FIG. 19) in Step 836 with the data on the 

specifies the range of addresses of the receive queue in the 50 message engine data bus 407D of bus 407 FIG. 10. The 

RAM 312R. The memory registers 457 are loaded during Rx_J)ATA_VALID signal accompanies each piece of data 

initialization. On the subsequent clock after the assertion of put on the bus 407. The data is sequentially loaded into the 

the Rx__WE signal, the message engine 315 state machine buffer registers 460 (FIG. 10). The End of the transfer on the 

410 the proceeds to place the data on a 32-bit message message engine data bus 407D of bus 407 is indicated by the 

engine 315 data bus 407, FIG. 10. A Rx_data_valid signal 55 assertion of the Rx_EOT signal. When the Receive message 

accompanies each 32 bits of data, indicating that the data on state machine 450 is in the End of transfer state Step 840 it 

the message engine data bus 407 is valid. In response to this signals the Receive CPU state machine 462 and this starts 

Rx_data_valid signal the receive message state machine the transfer on the CPU bus 317 side. 

450 loads the data on the data bus into the receive buffer The flow for the Receive CPU state machine is explained 

register 460. The end of the transfer over the message engine 60 below. Thus, referring to FIG. 15B, the End of the transfer 

data bus 407d is indicated by the assertion of the Rx„EOT on the Message engine data bus 407D portion of bus 407 

signal at which time the Receive message state machine 450 starts the Receive CPU state machine 462 and puts it in Step 

loads the last 32 bits of data on the message engine data bus 842. The Receive CPU state machine 462 checks for validity 

407D of bus 407, into the receive buffer registers 460. This of the address in this state (Step 844). This is done by the 

signals the end of the transfer over the message engine data 65 address check circuitry 456. If the address loaded in the 

bus 407D portion of bus 407. At the end of such transfer is address register 452 is outside the range of the receive queue 

conveyed to the Rx__Cpu state machine 462 by the assertion 312R in the RAM 312, the transfer is aborted and the status 
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is captured in the Receive status register 459 and the Rec 
Cpu state machine 462 proceeds to Step 845. On a valid 
address the Receive CPU state machine 462 goes to Step 
846. In Step 846 the Receive Cpu state machine 462 requests 
for access of the CPU bus 31. It then proceeds to Step 848. 5 
In step 848 it checks for a grant on the bus 317. On a 
qualified grant it proceeds to Step 850. In Step 850, the Rec 
Cpu state machine 462 performs an address and a data cycle, 
which essentially writes the data in the buffer registers 460 
into the receive queue 312R in RAM 312. Simultaneously 10 
with the write to the RAM 312, the data put on the CPU bus 
317 is also loaded into the duplicate buffer register 466. At 
same time, the CPU 310 also indicates on one of the control 
lines, if the data corresponding to the address written to in 
the RAM 312 is available in its local cache memory 319. At is 
the end of the address and data cycle the Rec Cpu state 
machine 462 proceeds to Step 850. In this step it checks for 
cache coherency errors of the type described above in 
connection with the transmit DMA 418 (FIG. 9). If cache 
coherency error is detected and the receive CPU state 20 
machine 462 proceeds to Step 846 and retries the transaction 
more particularly, the Receive CPU state machine 462 now 
generates another address and data cycle to the previous 
address and this time the data from the duplicate buffer 466 
is put on to the CPU data bus 317. If there were no cache 25 
coherency errors the Receive CPU state machine 462 pro- 
ceeds to Step 852 where it decrements the transfer counter 
454 and increment the address in the address register 452. 
The Receive Cpu state machine 462 then proceeds to Step 
854. In Step 854, the state machine 462 checks if the transfer 30 
counter has expired, i.e., is zero. On a non zero transfer 
count the receive Cpu state machine 462 proceeds to Step 
844 and repeats the above described procedure until the 
transfer becomes zero. A zero transfer count when in step 
854 completes the write into the receive queue 312R in 35 
RAM 312 and the Rec Cpu state machine proceeds to 845. 
In step 845, it conveys status stored in the status register 
back to status is conveyed to the message engine 315 state 
machine 410, 

Referring again to FIG. 7, the interrupt control status 40 
register 412 will be described in more detail. As described 
above, a packet is sent by the pocketsize portion of the 
packetizer/de-packetizer 428 to the crossbar switch 320 for 
transmission to one or more of the directors. It is to be noted 
that the packet sent by the packetizer portion of the 45 
packetizer/de-packetizer 428 passes through a parity gen- 
erator PG in the message engine 315 prior to passing to the 
crossbar switch 320. When such packet is sent by the 
message engine 315 in exemplary director 180^ to the 
crossbar switch 320, a parity bit is added to the packet by 50 
parity bit generator PG prior to passing to the crossbar 
switch 320. The parity of the packet is checked in the parity 
checker portion of a parity checker/generator (PG/C) in the 
crossbar switch 320. The result of the check is sent by the 
PG/C in the crossbar switch 320 to the interrupt control 55 
status register 412 in the director 180 lt 

Likewise, when a packet is transmitted from the crossbar 
switch 320 to the message engine 315 of exemplary director 
180 a , the packet passes through a parity generator portion of 
the parity checker/generator (PG/C) in the crossbar switch 60 
320 prior to being transmitted to the message engine 315 in 
director 180 r The parity of the packet is then checked in the 
parity checker portion of the parity checker (PC) in direction 
1S0 1 and is the result (i.e., status) is transmitted to the status 
register 412. 65 

A number of embodiments of the invention have been 
described. Nevertheless, it will be understood that various 
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modifications may be made without departing from the spirit 
and scope of the invention. Accordingly, other embodiments 
are within the scope of the following claims. 
What is claimed is: 

1. A system interface comprising: 
a plurality of first directors; 

a plurality of second directors; 

a data transfer section having a cache memory, such cache 
memory being coupled to the plurality of first and 
second directors; 

a messaging network, operative independently of the data 
transfer section, coupled to the plurality of first direc- 
tors and the plurality of second directors; and 

wherein the first and second directors control data transfer 
between the first directors and the second directors in 
response to messages passing between the first direc- 
tors and the second directors through the messaging 
network to facilitate data transfer between first direc- 
tors and the second directors with such data passing 
through the cache memory in the data transfer section; 

wherein each one of the first directors includes: 

a data pipe coupled between an input of such one of the 

first directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, 
the microprocessor, and the controller; and wherein 

the controller controls the transfer of the messages 
between the message network and such one of the 
first directors and the data between the input of such 
one of the first directors and the cache memory. 

2. The system interface recited in claim 1 wherein each 
one of the second directors includes: 

a data pipe coupled between an input of such one of the 

second directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, the 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

3. The system interface recited in claim 1 wherein each 
one of the controller includes a bus arbiter coupled to the 
common bus for arbitrating access to such common bus. 

4. The system recited in claim 2 wherein each one of the 
controllers in the second directors includes a bus arbiter 
coupled to the common bus for arbitrating access to such 
common bus. 

5. A data storage system for transferring data between a 
host computer/server and a bank of disks drives through a 
system interface, such system interface comprising: 

a plurality of first directors coupled to host computer/ 
server; 

a plurality of second directors coupled to the bank of disk 
drives; 

a data transfer section having a cache memory, such 
cache memory being coupled to the plurality of first 
and second directors; 

a messaging network, operative independently of the 
data transfer section, coupled to the plurality of first 
directors and the plurality of second directors; and 

wherein the first and second directors control data 
transfer between the host computer and the bank of 
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disk drives in response to messages passing between 
the first directors and the second directors through 
the messaging network to facilitate the data transfer 
between host computer/server and the bank of disk 
drives with such data passing through the cache 5 
memory in the data transfer section; 
wherein each one of the first directors includes: 
a data pipe coupled between an input of such one of 

the first directors and the cache memory; 
a microprocessor; io 
a controller; 

a common bus, such bus interconnecting the data 
pipe, the microprocessor, and the controller; and 
wherein 

the controller controls the transfer of the messages 35 
between the message network and such one of the 
first directors and the data between the input of 
such one of the first directors and the cache 
memory, 

6. The storage system recited in claim 5 wherein each one 20 
of the second directors includes: 

a data pipe coupled between an input of such one of the 

second directors and the cache memory; 
a microprocessor; 

a controller; 25 
a common bus, such bus interconnecting the data pipe, the 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 30 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

7. The storage system recited in claim 5 wherein each one 

of the controllers includes a bus arbiter coupled to the 35 
common bus for arbitrating access to such common bus. 

8. The storage system recited in claim 6 wherein each one 
of the controller in the second directors includes a bus arbiter 
coupled to the common bus for arbitrating access to such 
common bus. 40 

9. A method of operating a system interface having a 
plurality of first directors, a plurality of second directors and 
a data transfer section having a cache memory, such cache 
memory being coupled to the plurality of first and second 
directors, such method comprising: 
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providing a messaging network, operative independently 
of the data transfer section, coupled to the plurality of 
first directors and the plurality of second directors to 
control data transfer between the first directors and the 
second directors in response to messages passing 
between the first directors and the second directors 
through the messaging network to facilitate data trans- 
fer between first directors and the second directors with 
such data passing through the cache memory in the data 
transfer section; and providing each one of the first 
directors with: 

a data pipe coupled between an input of such one of the 

first directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, 
the microprocessor, and the controller; and wherein 

the controller controls the transfer of the messages 
between the message network and such one of the 
first directors and the data between the input of such 
one of the first directors and the cache memory. 

10. The method recited in claim 9 including providing 
each one of the second directors with: 

a data pipe coupled between an input of such one of the 

second directors and the cache memory; 
a microprocessor; 
a controller; 

a common bus, such bus interconnecting the data pipe, the 
microprocessor, and the controller; and wherein 
the controller controls the transfer of the messages 
between the message network and such one of the 
second directors and the data between the input of 
such one of the second directors and the cache 
memory. 

11. The method recited in claim 9 including providing 
each one of the controllers with a bus arbiter coupled to the 
common bus for arbitrating access to such common bus. 

12. The method recited in claim 10 including providing 
each one of the controller in the second directors with a bus 
arbiter coupled to the common bus for arbitrating access to 
such common bus. 

***** 
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